问题描述
我已经从Spotify抓取了一些数据,以查看是否可以对不同歌曲的音乐类型进行分类。 我已将数据分为测试集和剩余集,然后又分为训练和验证集。
运行模型时(我尝试在112个流派之间进行分类),我在验证集中获得30%的准确性。当然,这并不好,但是在112体裁和有限的数据中是可以预期的。真正令我困惑的是,当我将模型应用于测试数据时,准确性下降到1%。
我不确定为什么会这样:据我所知,验证和测试数据应具有可比性。我在应该完全独立的训练数据上训练模型。
我必须犯一些错误,要么让模型进入验证数据(那里的性能更好),要么弄乱我的测试数据。
或者两次应用模型会使事情搞砸了?
您知道会发生什么或如何调试吗?
非常感谢! 弗兰卡
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
# re-read data
track_df = pd.read_csv('track_df_corr.csv')
features = [ 'acousticness','speechiness','key','liveness','instrumentalness','energy','tempo','loudness','danceability','valence','duration_mins','year','genre']
track_df = track_df[features]
#First make a big split of all the data into test and train.
train,test = train_test_split(track_df,test_size=0.2,random_state = 0)
#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train
X_test_full = test
# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
#Since we later on want to validate our data on the testdata,we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
# numbers needed for classifier
# remove to be predicted variable
X_full.drop(['genre'],axis=1,inplace=True) # rest of training free of target,which is Now stored in y
X_test_full.drop(['genre'],inplace=True) # not sure if necessary but cannot hurt
# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set.
# Here we just create our training and validation sets from X_full.
X_train_full,X_valid_full,y_train,y_valid = \
train_test_split(X_full,y,train_size=0.8,random_state=0)
# General preprocessing steps: take care of categorical data (does not apply here).
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64','float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
#Time to run the model.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below
# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median')
# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator",here "categorical_transformer".
('imputer',SimpleImputer(strategy='most_frequent')),('onehot',OneHotEncoder(handle_unkNown='ignore'))
])
# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to
#be applied to subsets of the data.
('num',numerical_transformer,numerical_cols),('cat',categorical_transformer,categorical_cols)
])
# Define model
model = RandomForestClassifier(n_estimators=100,random_state=0)
# Bundle preprocessing and modeling code in a pipeline
# clf stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one
# Preprocessing of training data,fit model
clf = Pipeline(steps=[('preprocessor',preprocessor),('model',model)
])
# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model)
clf.fit(X_train,y_train)
# --------------------------------------------------------
#Test our model on the VALIDATION data
# Preprocessing of validation data,get predictions
preds = clf.predict(X_valid)
# Return the mean accuracy on the given test data and labels.
clf.score(X_valid,y_valid) # this is correct!
# The code yields a value around 30%.
# --------------------------------------------------------
Apply our model on the TESTING data
# Preprocessing of training data,fit model
preds_test = clf.predict(X_test)
clf.score(X_test,y_test)
#The code yields a value around 1%.
解决方法
我看到的问题是您正在使用pd.factorize
对火车和测试标签进行编码。由于您分别在pd.factorize
和y
上使用y_test
,因此生成的编码将彼此不对应。您想使用LabelEncoder
,以便fit
使用火车数据的编码器时,然后使用相同的编码方案对y_test
进行变换。。>
这是一个说明这一点的例子:
from sklearn.preprocessing import LabelEncoder
l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0,2,1],dtype=int64)
le.transform([1,4])
# array([0,dtype=int64)
在这里,我们得到了正确的编码。但是,如果我们应用pd.factorize
,显然熊猫无法猜测哪种是正确的编码:
pd.factorize(l)[0]
# array([0,dtype=int64)
pd.factorize([1,4])[0]
# array([0,2],dtype=int64)