分层采样数据集并平均训练数据集中的变量

问题描述

我目前正在尝试在R中进行分层拆分,以创建训练和测试数据集。 我面临的一个问题是以下

将数据拆分为训练样本并进行测试,以使70%的数据 在火车样本中。确保类似的价格分配 在整个训练样本和测试样本中,使用 插入符包。将组设置为100并使用1031的种子。 火车样本中的平均房价?

数据集是一组带有价格的房屋(以及其他数据点)

enter image description here

由于某种原因,当我运行以下代码时,我得到的输出在练习问题模拟器中被标记为不正确。谁能发现我的代码有问题?非常感谢您提供任何帮助,因为我正努力避免错误地学习该语言。

dput(head(houses))

library(ISLR); library(caret); library(caTools)
options(scipen=999)

set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100,stratefied on price,70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F,groups = 100)

train = houses[split,]
test = houses[-split,]

nrow(train)
nrow(test)
nrow(houses)

mean(train$price)
mean(test$price)

输出

> dput(head(houses))
structure(list(id = c(7129300520,6414100192,5631500400,2487200875,1954400510,7237550310),price = c(221900,538000,180000,604000,510000,1225000),bedrooms = c(3,3,2,4,4),bathrooms = c(1,2.25,1,4.5),sqft_living = c(1180,2570,770,1960,1680,5420),sqft_lot = c(5650,7242,10000,5000,8080,101930),floors = c(1,1),waterfront = c(0,0),view = c(0,condition = c(3,5,3),grade = c(7,7,6,8,11),sqft_above = c(1180,2170,1050,3890),sqft_basement = c(0,400,910,1530),yr_built = c(1955,1951,1933,1965,1987,2001),yr_renovated = c(0,1991,age = c(59,63,82,49,28,13)),row.names = c(NA,-6L
),class = c("tbl_df","tbl","data.frame"))
> 
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
> 
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100,70% in train
> split = createDataPartition(y = houses$price,groups = 100)
> 
> train = houses[split,]
> test = houses[-split,]
> 
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
> 
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6

解决方法

我尝试使用sample_frac格式的dplyr软件包和cut2软件包的Hmisc函数来手动重现它。结果几乎相同-仍然不同。 看起来伪数字生成器或一些舍入可能存在问题。 我认为您的代码似乎是正确的代码。 是否有可能在前面的步骤中以任何方式删除一些离群值或预处理数据集。

library(caret)
options(scipen=999)

library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)

diamonds$index = 1:nrow(diamonds)

set.seed(1031)

# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups

split = diamonds %>% 
group_by(cut2(diamonds$price,g= 100)) %>% 
sample_frac(0.7) %>%
pull(index)

train = diamonds[split,]
test = diamonds[-split,]

> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917

set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100,stratefied on price,70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T,groups = 100)


train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]

> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572

此抽样程序应得出近似于总体的平均值。