python reuters tex 分类 MDC 和 Naive bad 预测

问题描述

我有用于文档表示的词袋的下一个代码：

这会返回一个这样的数组

[[1 1 1 ... 0 0 0]
[1 1 1 ... 0 0 0]
[0 0 0 ... 0 0 0]
       ...
[1 1 1 ... 0 0 0]
[0 0 0 ... 0 0 0]
[1 1 1 ... 0 0 0]]

它适用于我的 kNN 实现。当我尝试实施 MDC 和 Naives 时，问题就出现了，两种实现的预测都是 acq acq acq..

Naive Bayes classifier

def fit(self,features,target):
    self.classes = np.unique(target)
    self.count = len(self.classes)
    self.feature_nums = features.shape[1]
    self.rows = features.shape[0]
    
    self.calc_statistics(features,target)
    self.calc_prior(features,target)
    
def predict(self,features):
    preds = [self.calc_posterior(f) for f in features.to_numpy()]
    return preds

Minimum distance classifier

def fit(self,X,y):
    self.class_list = np.unique(y,axis=0)
    
    self.centroids = np.zeros((len(self.class_list),X.shape[1]))# each row is a centroid
    
    for i in range(len(self.class_list)): # for each class,we evaluate its centroid
        temp = np.where(y==self.class_list[i])[0]
        self.centroids[i,:] = np.mean(X[temp],axis=0)
        
        
def predict(self,X):
    temp = np.argmin(
        cdist(X,self.centroids),# distance between each pair of the two collections of inputs
        axis=1
    )
    y_pred = np.array([self.class_list[i] for i in temp])

    return y_pred

解决方法

我认为您的错误是使用带有布尔向量的高斯 NB 来表示文档：

高斯 NB 意味着假设每个特征的条件概率的正态分布。如果您用文档中每个词的频率来表示文档，这可能没问题，因为这将是一个分布可能接近正态的数值变量（好吧，至少它有一个战斗的机会正常）
然而，布尔表示绝对不遵循正态分布，而是遵循简单的伯努利分布。您可以完美地修改您的 NB 实现以使用伯努利分布而不是高斯分布。在这种情况下，它更简单，而且绝对更合适，因此我建议这样做。

注意：我看到您选择了语料库中最常用的前 2000 个词。一般来说，我会说选择前 N 个最常用的词是个好主意，但我希望 NB 对 N 的值非常敏感，因此您可能需要对其进行调整以获得良好的结果。

python text-classification