验证和测试数据的性能差异很大

问题描述

我已经从Spotify抓取了一些数据，以查看是否可以对不同歌曲的音乐类型进行分类。我已将数据分为测试集和剩余集，然后又分为训练和验证集。

运行模型时（我尝试在112个流派之间进行分类），我在验证集中获得30％的准确性。当然，这并不好，但是在112体裁和有限的数据中是可以预期的。真正令我困惑的是，当我将模型应用于测试数据时，准确性下降到1％。

我不确定为什么会这样：据我所知，验证和测试数据应具有可比性。我在应该完全独立的训练数据上训练模型。

我必须犯一些错误，要么让模型进入验证数据（那里的性能更好），要么弄乱我的测试数据。

或者两次应用模型会使事情搞砸了？

您知道会发生什么或如何调试吗？

非常感谢！弗兰卡


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle

# re-read data
track_df = pd.read_csv('track_df_corr.csv') 


features = [ 'acousticness','speechiness','key','liveness','instrumentalness','energy','tempo','loudness','danceability','valence','duration_mins','year','genre']


track_df = track_df[features]

#First make a big split of all the data into test and train.
train,test = train_test_split(track_df,test_size=0.2,random_state = 0)

#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train 
X_test_full = test 

# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
  
#Since we later on want to validate our data on the testdata,we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
                    # numbers needed for classifier


# remove to be predicted variable
X_full.drop(['genre'],axis=1,inplace=True) # rest of training free of target,which is Now stored in y
X_test_full.drop(['genre'],inplace=True) # not sure if necessary but cannot hurt


# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set. 
# Here we just create our training and validation sets from X_full.
X_train_full,X_valid_full,y_train,y_valid = \
            train_test_split(X_full,y,train_size=0.8,random_state=0)
 
# General preprocessing steps: take care of categorical data (does not apply here).

categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64','float64']]



# Keep selected columns only
my_cols = categorical_cols + numerical_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()



#Time to run the model.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below

# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median') 


# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator",here "categorical_transformer".
    ('imputer',SimpleImputer(strategy='most_frequent')),('onehot',OneHotEncoder(handle_unkNown='ignore'))
])


# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
    transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to 
                        #be applied to subsets of the data.
        ('num',numerical_transformer,numerical_cols),('cat',categorical_transformer,categorical_cols)
    ])

# Define model
model = RandomForestClassifier(n_estimators=100,random_state=0)

# Bundle preprocessing and modeling code in a pipeline
# clf  stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one

# Preprocessing of training data,fit model 
clf = Pipeline(steps=[('preprocessor',preprocessor),('model',model)
                     ])


# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model) 
clf.fit(X_train,y_train)


# --------------------------------------------------------

#Test our model on the VALIDATION data

# Preprocessing of validation data,get predictions
preds = clf.predict(X_valid)

# Return the mean accuracy on the given test data and labels.
clf.score(X_valid,y_valid) # this is correct! 

# The code yields a value around 30%. 

# --------------------------------------------------------

Apply our model on the TESTING data
# Preprocessing of training data,fit model 
preds_test = clf.predict(X_test)
clf.score(X_test,y_test)

#The code yields a value around 1%.

解决方法

我看到的问题是您正在使用pd.factorize对火车和测试标签进行编码。由于您分别在pd.factorize和y上使用y_test，因此生成的编码将彼此不对应。您想使用LabelEncoder，以便fit使用火车数据的编码器时，然后使用相同的编码方案对y_test进行变换。。>

这是一个说明这一点的例子：

from sklearn.preprocessing import LabelEncoder

l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0,2,1],dtype=int64)
le.transform([1,4])
# array([0,dtype=int64)

在这里，我们得到了正确的编码。但是，如果我们应用pd.factorize，显然熊猫无法猜测哪种是正确的编码：

pd.factorize(l)[0]
# array([0,dtype=int64)
pd.factorize([1,4])[0]
# array([0,2],dtype=int64)

machine-learning python random-forest scikit-learn