问题描述
我正在尝试使用钻石数据集和 r 编程来创建一个简单的模型来预测钻石价格。即使预测值与真实数据集相差 100,其预测准确度也非常高,即 97.1%。我想创建一个线性回归模型,以提供更好的准确性。
数据集: https://www.kaggle.com/shivam2503/diamonds
代码:
setwd ("C:/akash/study videos/virginia")
akash = read.csv("diamonds.csv")
#summary(akash)
ind = sample(2,nrow(akash),replace = TRUE,prob = c(0.8,0.2))
train = akash[ind==1,]
test = akash[ind==2,]
mod = step(lm(log(price)~.,data=train))
summary(mod)
predicted = predict(mod,newdata = test)
mon = round(exp(predicted),0)
head(mon)
#head(test)
View(akash)
输出如下。
> summary(mod)
Call:
lm(formula = log(price) ~ X + carat + cut + color + clarity +
depth + table + x + y + z,data = train)
Residuals:
Min 1Q Median 3Q Max
-2.1859 -0.0917 0.0034 0.0915 9.8090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.791e+00 7.432e-02 -37.553 < 2e-16 ***
X -1.331e-06 6.223e-08 -21.395 < 2e-16 ***
carat -5.148e-01 8.663e-03 -59.419 < 2e-16 ***
cutGood 8.499e-02 6.110e-03 13.910 < 2e-16 ***
cutIdeal 1.525e-01 6.089e-03 25.055 < 2e-16 ***
cutPremium 1.049e-01 5.870e-03 17.864 < 2e-16 ***
cutVery Good 1.196e-01 5.874e-03 20.366 < 2e-16 ***
colorE -5.672e-02 3.222e-03 -17.602 < 2e-16 ***
colorF -8.780e-02 3.267e-03 -26.877 < 2e-16 ***
colorG -1.563e-01 3.195e-03 -48.901 < 2e-16 ***
colorH -2.597e-01 3.399e-03 -76.404 < 2e-16 ***
colorI -3.855e-01 3.810e-03 -101.196 < 2e-16 ***
colorJ -5.227e-01 4.709e-03 -111.008 < 2e-16 ***
clarityIF 1.093e+00 9.189e-03 118.966 < 2e-16 ***
claritySI1 6.068e-01 7.858e-03 77.223 < 2e-16 ***
claritySI2 4.383e-01 7.899e-03 55.486 < 2e-16 ***
clarityVS1 8.165e-01 8.026e-03 101.733 < 2e-16 ***
clarityVS2 7.491e-01 7.899e-03 94.838 < 2e-16 ***
clarityVVS1 1.002e+00 8.503e-03 117.855 < 2e-16 ***
clarityVVS2 9.359e-01 8.259e-03 113.319 < 2e-16 ***
depth 5.000e-02 8.168e-04 61.216 < 2e-16 ***
table 9.379e-03 5.298e-04 17.704 < 2e-16 ***
x 1.124e+00 5.644e-03 199.152 < 2e-16 ***
y 2.800e-02 3.125e-03 8.962 < 2e-16 ***
z 3.685e-02 5.731e-03 6.429 1.3e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1818 on 42968 degrees of freedom
Multiple R-squared: 0.9679,Adjusted R-squared: 0.9679
F-statistic: 5.398e+04 on 24 and 42968 DF,p-value: < 2.2e-16
> head(mon)
9 14 28 31 32 39
475 346 435 479 496 595
如果可以的话,请您解释一下考虑了哪些参数?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)