重新采样数据时的过度拟合和交叉验证k-fold

问题描述

我想减少模型中的过度拟合。在特征选择过程中,我已经运行了多重共线性测试以排除模型中的特征。 现在我需要应用 k 折交叉验证。 这是一个文本分类问题,正好用于检测垃圾邮件/非垃圾邮件。我提取了几个特征,为简单起见,我只是将它们表示为分类、数字、文本。 一世 我做了以下事情:

# DeFinition of X and y

X=df[text_feature + categorical_features + numerical_features]
y=df[['Label']]


# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.20)

# Applying downsampling
# Separating classes

def downsampling(data):
    spam = data[data.Label == 1]
    not_spam = data[data.Label == 0]

# Downsampling the majority 
    downsample = resample(spam,replace=True,n_samples=len(not_spam),random_state=42)

# Returning to new training set
    downsample_train = pd.concat([not_spam,oversample])
    return downsample_train

downsample_train = downsampling(X_train)
train_df= downsample_train.copy() 
test_df = pd.concat([X_test,y_test],axis=1)

# Creating the Bag of Words model and apply other pre-processors

categorical_preprocessing = OneHotEncoder(handle_unkNown='ignore')

numeric_preprocessing = Pipeline([
    ('imputer',SimpleImputer(strategy='mean')
])

# CountVectorizer
text_preprocessing_cv =  Pipeline(steps=[
    ('CV',CountVectorizer())
])

# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
    ('TF-IDF',TfidfVectorizer())       
])


preprocessing = ColumnTransformer(
    transformers=[
        ('text',text_preprocessing_cv,'Text')
        ('category',categorical_preprocessing,categorical_features),('numeric',numeric_preprocessing,numerical_features)
],remainder='passthrough')

clf_lr = Pipeline(steps=[('preprocessor',preprocessing),('classifier',LogisticRegression())])
pipelines(clf_lr,X_train,X_test)

我正在考虑的功能示例是

  • 文本(例如,您赢了一个惊人的价格!!!,亲爱的约翰,我希望您准备好迎接这个好消息!!!!!!:),...)
  • 年份(例如,2019 年、2020 年、...)
  • #_of_characters_Subj(例如,34、67、...):该值来自主题
  • 地址(例如,abc@gmail.com、ghi@yahoo.com ...)
  • Spam (e.g.,1,...) :这是一个布尔变量。垃圾邮件为 1,非垃圾邮件为 0

据我所知,当运行重采样时,它仅在火车集上运行以避免高估。如果 k=5,k 折验证拆分应应用于训练数据(例如 4 折)和测试数据(1 折)。 我尝试使用函数包含交叉验证:

def bc_matrix(classifier):
    
    k_fold = KFold(n_splits=5) 
    scores = []
    confusion = np.array([[0,0],[0,0]])

    for train_ind,test_ind in k_fold.split(train_df):
        
        # Train
        train_c = train_df.iloc[train_ind]
        train_y = train_df.iloc[train_ind]['Label']
        
        # Test
        test_c =train_df.iloc[test_ind]
        test_y = train_df.iloc[test_ind]['Label']
        
        classifier.fit(train_c,train_y) # Fit the model
        predictions = classifier.predict(test_c) 
        
        confusion += confusion_matrix(test_y,predictions)
 
    
    return (
    
#K-fold cross validation for each classifier
bc_matrix(clf_lr)

但是这里有一个问题:

---> 19classifier.fit(train_feat,train_y) #拟合模型

IndexError:元组索引超出范围。

数据示例:

Text                                                             Year #_of_characters_Subj
You won an amazing price!!!                                      2019  34
Dear John,I hope you are ready for this great news!!!!!!!:)     2020  67
It is awesome                                                    2012  56

Address                 Spam
abc@gmail.com             1
ghi@yahoo.com             0
yes_we_can@live.com       1

哪里

垃圾邮件是我的目标变量。

如果您能提供一些帮助来修复错误以预测测试结果,我将不胜感激(希望 cv 应该有助于减少过度拟合)。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)