我无法理解在我的代码中计算出的FPR,TPR,阈值,ROC值的含义直觉?

问题描述

我想为我的分类模型绘制ROC曲线。当我刚接触这本书时,我读到了它,看到了几篇文章,并引用this SO answer创建了roc曲线。

我的数据是这种类型的

print(Y.shape)
print(predictions.shape)
print(Y)
print(predictions)

(1,400)
(1,400)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1]]
[[0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0
  1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1
  0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1
  1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0
  0 0 1 0]]

执行代码后:

from sklearn.metrics import precision_score
print('Precsion score: '+ str(precision_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import recall_score
print('Recall score: '+ str(recall_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import f1_score
print('F1 score: '+ str(f1_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import roc_auc_score,auc,roc_curve
print('ROC score: ' + str(roc_auc_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import confusion_matrix
print('Confusion matrix: ')
print(confusion_matrix(Y.ravel(),predictions.ravel()))

fpr = dict()
tpr = dict()
threshold = dict()
roc_auc = dict()

for i in range(2):
    fpr[i],tpr[i],threshold[i] = roc_curve(Y.ravel(),predictions.ravel())
    roc_auc[i] = auc(fpr[i],tpr[i])
print(fpr,tpr,threshold,roc_auc)

plt.figure()
plt.plot(fpr[1],tpr[1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic(ROC Curve)')

输出

Precsion score: 0.9179487179487179
Recall score: 0.895
F1 score: 0.9063291139240507
ROC score: 0.9075
Confusion matrix: 
[[184  16]
 [ 21 179]]
{0: array([0.,0.08,1.  ]),1: array([0.,1.  ])} {0: array([0.,0.895,1.   ]),1.   ])} {0: array([2,1,0]),1: array([2,0])} {0: 0.9075,1: 0.9075}
Text(0.5,1.0,'Receiver operating characteristic(ROC Curve)')

ROC Curve

我不明白为什么要使用循环?我可以看到在每行中为FPR,TPR,Threshold和roc_auc计算了三个值。我确实读过roc_curve将概率作为目标分数(我将继续努力)。但是,我无法从输入的(1,400)维数据中得出这些数组的计算方式?

谢谢。

解决方法

我也不明白,为什么要使用循环,因为通过删除并调整代码,您可以具有与代码相同的功能:

import matplotlib.pyplot as plt
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score,auc,roc_curve
from sklearn.metrics import confusion_matrix

Y=[0,1,1]

predictions=[0,0]

print('Precsion score: '+ str(precision_score(Y,predictions)))
print('Recall score: '+ str(recall_score(Y,predictions)))
print('F1 score: '+ str(f1_score(Y,predictions)))
print('ROC score: ' + str(roc_auc_score(Y,predictions)))
print('Confusion matrix: ')
print(confusion_matrix(Y,predictions))

fpr,tpr,threshold = roc_curve(Y,predictions)
roc_auc = auc(fpr,tpr)

print(fpr,threshold,roc_auc)

plt.figure()
plt.plot(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic(ROC Curve)')

它产生输出:

Precsion score: 0.9179487179487179
Recall score: 0.895
F1 score: 0.9063291139240507
ROC score: 0.9075
Confusion matrix: 
[[184  16]
 [ 21 179]]
[0.   0.08 1.  ] [0.    0.895 1.   ] [2 1 0] 0.9075

enter image description here

您使用了400个数据点来计算ROC曲线,但是在可视化中仅出现三个数据点,因为您的数据中只有两个唯一值(0和1)。

引用here中的答案:

点数取决于变量中唯一值的数目 输入。由于输入向量只有2个唯一值,因此该函数 给出正确的输出。