我无法理解在我的代码中计算出的FPR，TPR，阈值，ROC值的含义直觉？

问题描述

我想为我的分类模型绘制ROC曲线。当我刚接触这本书时，我读到了它，看到了几篇文章，并引用this SO answer创建了roc曲线。

我的数据是这种类型的

print(Y.shape)
print(predictions.shape)
print(Y)
print(predictions)

(1,400)
(1,400)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1]]
[[0 1 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0 0
  1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1
  0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1
  1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0
  0 0 1 0]]

执行代码后：

from sklearn.metrics import precision_score
print('Precsion score: '+ str(precision_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import recall_score
print('Recall score: '+ str(recall_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import f1_score
print('F1 score: '+ str(f1_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import roc_auc_score,auc,roc_curve
print('ROC score: ' + str(roc_auc_score(Y.ravel(),predictions.ravel())))

from sklearn.metrics import confusion_matrix
print('Confusion matrix: ')
print(confusion_matrix(Y.ravel(),predictions.ravel()))

fpr = dict()
tpr = dict()
threshold = dict()
roc_auc = dict()

for i in range(2):
    fpr[i],tpr[i],threshold[i] = roc_curve(Y.ravel(),predictions.ravel())
    roc_auc[i] = auc(fpr[i],tpr[i])
print(fpr,tpr,threshold,roc_auc)

plt.figure()
plt.plot(fpr[1],tpr[1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic(ROC Curve)')

输出：

Precsion score: 0.9179487179487179
Recall score: 0.895
F1 score: 0.9063291139240507
ROC score: 0.9075
Confusion matrix: 
[[184  16]
 [ 21 179]]
{0: array([0.,0.08,1.  ]),1: array([0.,1.  ])} {0: array([0.,0.895,1.   ]),1.   ])} {0: array([2,1,0]),1: array([2,0])} {0: 0.9075,1: 0.9075}
Text(0.5,1.0,'Receiver operating characteristic(ROC Curve)')

我不明白为什么要使用循环？我可以看到在每行中为FPR，TPR，Threshold和roc_auc计算了三个值。我确实读过roc_curve将概率作为目标分数（我将继续努力）。但是，我无法从输入的（1,400）维数据中得出这些数组的计算方式？

谢谢。

解决方法

我也不明白，为什么要使用循环，因为通过删除并调整代码，您可以具有与代码相同的功能：

import matplotlib.pyplot as plt
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score,auc,roc_curve
from sklearn.metrics import confusion_matrix

Y=[0,1,1]

predictions=[0,0]

print('Precsion score: '+ str(precision_score(Y,predictions)))
print('Recall score: '+ str(recall_score(Y,predictions)))
print('F1 score: '+ str(f1_score(Y,predictions)))
print('ROC score: ' + str(roc_auc_score(Y,predictions)))
print('Confusion matrix: ')
print(confusion_matrix(Y,predictions))

fpr,tpr,threshold = roc_curve(Y,predictions)
roc_auc = auc(fpr,tpr)

print(fpr,threshold,roc_auc)

plt.figure()
plt.plot(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic(ROC Curve)')

它产生输出：

Precsion score: 0.9179487179487179
Recall score: 0.895
F1 score: 0.9063291139240507
ROC score: 0.9075
Confusion matrix: 
[[184  16]
 [ 21 179]]
[0.   0.08 1.  ] [0.    0.895 1.   ] [2 1 0] 0.9075

您使用了400个数据点来计算ROC曲线，但是在可视化中仅出现三个数据点，因为您的数据中只有两个唯一值（0和1）。

引用here中的答案：

点数取决于变量中唯一值的数目输入。由于输入向量只有2个唯一值，因此该函数给出正确的输出。

machine-learning numpy python roc scikit-learn