重新采样数据时的过度拟合和交叉验证k-fold

问题描述

我想减少模型中的过度拟合。在特征选择过程中，我已经运行了多重共线性测试以排除模型中的特征。现在我需要应用 k 折交叉验证。这是一个文本分类问题，正好用于检测垃圾邮件/非垃圾邮件。我提取了几个特征，为简单起见，我只是将它们表示为分类、数字、文本。一世我做了以下事情：

# DeFinition of X and y

X=df[text_feature + categorical_features + numerical_features]
y=df[['Label']]


# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.20)

# Applying downsampling
# Separating classes

def downsampling(data):
    spam = data[data.Label == 1]
    not_spam = data[data.Label == 0]

# Downsampling the majority 
    downsample = resample(spam,replace=True,n_samples=len(not_spam),random_state=42)

# Returning to new training set
    downsample_train = pd.concat([not_spam,oversample])
    return downsample_train

downsample_train = downsampling(X_train)
train_df= downsample_train.copy() 
test_df = pd.concat([X_test,y_test],axis=1)

# Creating the Bag of Words model and apply other pre-processors

categorical_preprocessing = OneHotEncoder(handle_unkNown='ignore')

numeric_preprocessing = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')
])

# CountVectorizer
text_preprocessing_cv =  Pipeline(steps=[
    ('CV',CountVectorizer())
])

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
])


preprocessing = ColumnTransformer(
    transformers=[
        ('text',text_preprocessing_cv,'Text')
        ('category',categorical_preprocessing,categorical_features),('numeric',numeric_preprocessing,numerical_features)
],remainder='passthrough')

clf_lr = Pipeline(steps=[('preprocessor',preprocessing),('classifier',LogisticRegression())])
pipelines(clf_lr,X_train,X_test)

我正在考虑的功能示例是

文本（例如，您赢了一个惊人的价格！！！，亲爱的约翰，我希望您准备好迎接这个好消息！！！！！！：），...）
年份（例如，2019 年、2020 年、...）
#_of_characters_Subj（例如，34、67、...）：该值来自主题
地址（例如，abc@gmail.com、ghi@yahoo.com ...）
Spam (e.g.,1,...) ：这是一个布尔变量。垃圾邮件为 1，非垃圾邮件为 0

据我所知，当运行重采样时，它仅在火车集上运行以避免高估。如果 k=5，k 折验证拆分应应用于训练数据（例如 4 折）和测试数据（1 折）。我尝试使用函数包含交叉验证：

def bc_matrix(classifier):
    
    k_fold = KFold(n_splits=5) 
    scores = []
    confusion = np.array([[0,0],[0,0]])

    for train_ind,test_ind in k_fold.split(train_df):
        
        # Train
        train_c = train_df.iloc[train_ind]
        train_y = train_df.iloc[train_ind]['Label']
        
        # Test
        test_c =train_df.iloc[test_ind]
        test_y = train_df.iloc[test_ind]['Label']
        
        classifier.fit(train_c,train_y) # Fit the model
        predictions = classifier.predict(test_c) 
        
        confusion += confusion_matrix(test_y,predictions)
 
    
    return (
    
#K-fold cross validation for each classifier
bc_matrix(clf_lr)

但是这里有一个问题：

---> 19classifier.fit(train_feat,train_y) #拟合模型

IndexError：元组索引超出范围。

数据示例：

Text                                                             Year #_of_characters_Subj
You won an amazing price!!!                                      2019  34
Dear John,I hope you are ready for this great news!!!!!!!:)     2020  67
It is awesome                                                    2012  56

Address                 Spam
abc@gmail.com             1
ghi@yahoo.com             0
yes_we_can@live.com       1

哪里

文字是文字特征
地址是分类的
Year 和 _of_characters_Subj 是数字

垃圾邮件是我的目标变量。

如果您能提供一些帮助来修复错误以预测测试结果，我将不胜感激（希望 cv 应该有助于减少过度拟合）。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

cross-validation machine-learning python resampling scikit-learn