Sklearn准确性得分与朴素贝叶斯分类器的输出结果不匹配

问题描述

我有以下情况：我需要从字符串列表（其中的500,000个）中区分出哪些与企业相关的字符串以及哪些人。

问题的简化示例：

Stackoverflow LLC->业务
John Doe->人
John Doe Inc.->商业

对我来说幸运的是，我为500,000个名字加了标签，所以这成为一个有监督的问题。是的。

我运行的第一个模型是一个简单的朴素贝叶斯（多项式），下面是代码：

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(df["CUST_NM_CLEAN"],df["LABEL"],test_size=0.20,random_state=1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. 
testing_data = count_vector.transform(X_test)

#in this case we try multinomial,there are two other methods
from sklearn.naive_bayes import cNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
#MultinomialNB(alpha=1.0,class_prior=None,fit_prior=True)

predictions = naive_bayes.predict(testing_data)


from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
print('Accuracy score: {}'.format(accuracy_score(y_test,predictions)))
print('Precision score: {}'.format(precision_score(y_test,predictions,pos_label='Org')))
print('Recall score: {}'.format(recall_score(y_test,pos_label='Org')))
print('F1 score: {}'.format(f1_score(y_test,pos_label='Org')))

我得到的结果：

准确性得分：0.9524850665857665
精度得分：0.9828196680932295
召回得分：0.8890405236039549
F1得分：0.9335809546092653

刚开始时不要太寒酸。但是，当我将结果导出到文件中并将预测结果与标签进行比较时，我得到的准确性很低，大约为60％。这与sklearn输出的95％分数相去甚远...

有什么想法吗？

这是我输出文件的方式，可能是这种情况：

mnb_results = np.array(list(zip(df["CUST_NM_CLEAN"].values.tolist(),predictions)))
mnb_results = pd.DataFrame(mnb_results,columns=['name','predicted','label'])
mnb_results.to_csv('mnb_vectorized.csv',index = False)

P.s。我是这里的新手，如果这里有透明的溶剂，我深表歉意。

解决方法

要注意的一件事是导出到csv。如果您正在使用csv进行验证，那么我认为您将需要导出x_test，y_test和预测。另外，还可以进行交叉验证以检查其是否按预期执行。

旧：

mnb_results = np.array(list(zip(df["CUST_NM_CLEAN"].values.tolist(),df["LABEL"],predictions)))

已更改：

mnb_results = np.array(list(zip(X_test,y_test,predictions)))

更多详细信息：

# Get the accuracy score using numpy,(Similarly others):
import numpy as np
true = np.asarray([1.0,0.0,1.0,1.0])
predictions = np.asarray([1.0,1.0])
print("Accuracy:{}".format(np.mean(true==predictions)))

naivebayes python scikit-learn