特征选择会大大降低准确性

问题描述

我一直在使用 Pyswarms,特别是使用 discrete.binaryPSO 来执行特征选择,因为它是一种优化技术,有助于执行特征子集选择以提高分类性能。 (https://pyswarms.readthedocs.io/en/development/examples/feature_subset_selection.html)

我的数据集基于带有相应标签(以 1 和 0 标识)的文本数据。预处理后,我将 countvectorizer 和 tfidftransformer 合并到文本数据中。

然而,与结合 pyswarms 相比,使用 sklearn 的简单机器学习分类器预测的准确度要高得多。无论我使用什么数据集,在合并离散二进制时添加的预处理技术和函数,我的准确率、准确率和召回率都低于使用 SKlearn 的简单机器学习分类

我的代码附在下面,对这种情况的任何帮助表示赞赏:

# Create an instance of the classifier
classifier = LogisticRegression()
# Define objective function


# Define objective function
def f_per_particle(m,alpha):

    total_features = training_data.shape[1]
    # Get the subset of the features from the binary mask
    if np.count_nonzero(m) == 0:
        X_subset = training_data
    else:
        X_subset = training_data[:,m==1]
    # Perform classification and store performance in P
    classifier.fit(X_subset,y_train)
    P = (classifier.predict(X_subset) == y_train).mean()
    # Compute for the objective function
    j = (alpha * (1.0 - P)
        + (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))

    return j

def f(x,alpha=0.88):
    """Higher-level method to do classification in the
    whole swarm.

    Inputs
    ------
    x: numpy.ndarray of shape (n_particles,dimensions)
        The swarm that will perform the search

    Returns
    -------
    numpy.ndarray of shape (n_particles,)
        The computed loss for each particle
    """
    n_particles = x.shape[0]
    j = [f_per_particle(x[i],alpha) for i in range(n_particles)]
    return np.array(j)


options = {'c1':0.5,'c2': 0.5,'w':0.9,'k': 10,'p':2}
# Call instance of PSO
dimensions = training_data.shape[1] # dimensions should be the number of features
optimizer = ps.discrete.BinaryPSO(n_particles=10,dimensions=dimensions,options=options)

# Perform optimization
cost,pos = optimizer.optimize(f,iters=10)

print('selected features = ' + str(sum((pos == 1)*1)) + '/' + str(len(pos)))
classifier.fit(training_data,y_train)
print('accuracy before FS = ' + str(accuracy_score(y_test,classifier.predict(testing_data),normalize = True)*100))
X_subset = training_data[:,pos==1]
classifier.fit(X_subset,y_train)
print('accuracy after FS = ' + str(accuracy_score(y_test,classifier.predict(testing_data[:,pos==1]),normalize = True)*100))

解决方法

由于特征选择不会产生更好的性能,我建议使用机器学习模型中的所有特征并查看每个特征的影响。你可能会发现 https://shap.readthedocs.io/en/latest/index.html[SHAP][1] 有助于解释输出,然后查看每个功能为此目的的重要性。