如何查看经过分类器的数据？

问题描述

我有一个伯努利分类器，我在一组随机分类数据上运行。该模型表现相对较好（~92%），但我想知道是否有办法在数据通过分类器后查看数据（即查看哪个数据点被分类器归类为什么）。这是我当前的代码：

docs_train_n,docs_test_n,y_train_n,y_test_n = train_test_split(x_dt.sent_lemmas,x_dt.iloc[:,1],test_size = 0.33,random_state = 12)

dtm_train_n=cv_tdif.fit_transform(docs_train_n)
dtm_test_n=cv_tdif.transform(docs_test_n)


clf_n = BernoulliNB()
clf_n.fit(dtm_train_n,y_train_n)
y_pred_n = clf_n.predict(dtm_test_n)
cm_n= confusion_matrix(y_test_n,y_pred_n)

print(cm_n)


#ROC-AOC curves


y_pred_prob_n = clf_n.predict_proba(dtm_test_n)[:,1]
fpr,tpr,thresholds = roc_curve(y_test_n,y_pred_prob_n)

plt.subplots(1,figsize=(10,10))
plt.title('Receiver Operating Characteristic - TD-IF Vectorizer: BernoulliNB')
plt.plot(fpr,tpr)
plt.plot([0,ls="--")
plt.plot([0,0],[1,c=".7"),plt.plot([1,c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

auc = roc_auc_score(y_true=y_test_n,y_score=y_pred_prob_n)
print('Area under curve is {}'.format(round(auc,2)))

解决方法

分类器的输出有两种“模式”；使用 .predict 返回离散类标签，而 .predict_proba 返回属于每个类的概率向量。所以对于二类场景，这个输出不是二进制的，所以请求看

哪个数据点被分类器分类为什么

定义不明确。
（同样，你得到的是一个连续的分数，可以解释为来自一个类或另一个类的概率）。您可以以多种格式查看此信息，例如按真实类别标签 (plt.scatter) 着色的分数散点图、按类别的分数分布（seaborn 的 sns.distplot）等.
或者，您可以通过对输出进行阈值处理，然后显示真/假阳性/阴性案例的示例来查看二值化分类输出。
不同的阈值会给出不同的结果，这正是您的 ROC 所显示的；性能随着阈值的变化而变化。

更一般地说，我强烈建议您了解您的代码实际上做什么。只是在不知道所使用的不同函数的含义的情况下进行编码并不是解决问题的方法......
这是您的代码的带注释版本，应该可以帮助您走上正轨：

# Split your dataset into two distinct subsets,with a train:test ratio of ~2:1. 
# Out put is train-inputs,test-inputs,train-labels,test-labels,in that order.
docs_train_n,docs_test_n,y_train_n,y_test_n = train_test_split(x_dt.sent_lemmas,x_dt.iloc[:,1],test_size = 0.33,random_state = 12)

# No code describing cv_tdif,but I'm guessing that this is an embedding transformation. The output is the training input data in the feature space relevant for fitting your naive-bayes-bernoulli-model. 
dtm_train_n=cv_tdif.fit_transform(docs_train_n)
dtm_test_n=cv_tdif.transform(docs_test_n)

# Create bernoulli-naive-bayes classifier instance
clf_n = BernoulliNB()

# Fit the model with the training data (inputs and labels)
clf_n.fit(dtm_train_n,y_train_n)

# Use trained model to predict labels on the test input-data
# Outputs binary predictions,thresholded with 0.5 
# (and more generally,for multi-class,just by using the maximal probability)
y_pred_n = clf_n.predict(dtm_test_n)

# Produce a confusion matrix comparing true labels and estimated labels
cm_n= confusion_matrix(y_test_n,y_pred_n)

# Print said confusion-matrix
print(cm_n)


#ROC-AOC curves

# Get probability predictions,and instead of looking at the probability for each class,# only return the probabilities of the second class
y_pred_prob_n = clf_n.predict_proba(dtm_test_n)[:,1]

# Produce false-positive-rate and true-positive-rate as function of threshold-varying,# along with the thresholds that produce these specific FPR and TPR. 
# TPR as function of FPR is exactly a ROC-curve
fpr,tpr,thresholds = roc_curve(y_test_n,y_pred_prob_n)

# Plot ROC curve (with title,defining lines for qualitative comparison,labels,etc.)
plt.subplots(1,figsize=(10,10))
plt.title('Receiver Operating Characteristic - TD-IF Vectorizer: BernoulliNB')
plt.plot(fpr,tpr)
plt.plot([0,ls="--")
plt.plot([0,0],[1,c=".7"),plt.plot([1,c=".7")
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Calculate (and print) the area under the ROC curve. 
# This is an aggregational scalar describing the quality of your classifier. 
# ROC-AUC = 1 is a perfect classifier,ROC-AUC=0.5 is as good as random. 
auc = roc_auc_score(y_true=y_test_n,y_score=y_pred_prob_n)
print('Area under curve is {}'.format(round(auc,2)))

估计标签的顺序与输入相同。所以每个输入都有一个具体而精确的对应输出。
这允许对特定测试样本进行索引检查：

ind = 17
test_document = docs_test_n[ind]
test_embedded_vector = dtm_test_n[ind]
prob_of_beloning_to_class_2 = y_pred_prob_n[ind]

以及我推荐的地块。

classification nlp python scikit-learn