问题描述
这是我的数据集,Median_Price
是我的目标变量
代码中随附了使用GridSearch CV参数调整前后的RMSE VALUE。如何根据我的数据集降低RMSE?
数据集将从Google驱动器here下载,并且我还添加了数据集图片以供理解。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import gridsearchcv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline
dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')
dataset['Median_Price'] = dataset['Median_Price'].str.replace(',','').astype(int)
dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)
dataset['Type1'] = pd.to_numeric(dataset['Type1'],errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'],errors='coerce')
dataset = dataset.replace(np.nan,regex=True)
X = dataset[['Type1','Type2','Filed Transactions','population','Jr Secure Technology']]
y = dataset['Median_Price']
from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
scores = cross_val_score(model,X_train,y_train,cv=5,scoring='neg_mean_squared_error')
print('CV Mean: ',np.mean(scores))
print('STD: ',np.std(scores))
print('\n')
regressor = LinearRegression()
regressor.fit(X_train,y_train)
# get cross val scores
get_cv_scores(regressor)
from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train,y_train)# get cross val scores
get_cv_scores(ridge)
# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = gridsearchcv(estimator=ridge,param_grid=param_grid,scoring='neg_mean_squared_error',verbose=1,n_jobs=-1)
grid_result = grid.fit(X_train,y_train)
print('Best score: ',grid_result.best_score_)
print('Best Params: ',grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438
coeff_df = pd.DataFrame(regressor.coef_,X.columns,columns=['Coefficient'])
coeff_df
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))][1]
解决方法
好吧,使用GridSearchCV之后,RMSE值似乎有所降低。
您可以尝试特征选择,特征工程,缩放数据,转换,尝试其他算法,这些都可以在某种程度上帮助您降低RMSE值。
此外,RMSE值完全取决于数据的上下文。似乎您的数据点彼此分离,这给您非常高的RMSE值。我上面提到的各种技术只能在有限的范围内帮助您降低RMSE。