用于二元分类的 fasttext ROC 和 AUC 问题

问题描述

我试图计算 fasttext 训练模型的 ROC 和 AUC,但我总是收到错误 ValueError: Found input variables with inconsistent numbers of samples: [40,200]

我的测试代码如下:

def split_df(data):
     count_vect = CountVectorizer()
     print('Loading data ...')
     labels,texts = ([],[])
     for line in data:
       label,text = line.split(' ',1)
       labels.append(label)
       texts.append(text)

     trainDF = pd.DataFrame()
     trainDF['label'] = labels
     trainDF['text'] = texts

     # to fit the text in the dataframe
     # You have to do some encoding before using fit. As it kNown fit() does not accept Strings.
     count_vect = CountVectorizer()
     matrix = count_vect.fit_transform(trainDF['text'])
     encoder = LabelEncoder()
     targets = encoder.fit_transform(trainDF['label'])

     # split into train/test sets
     trainX,testX,trainy,testy = train_test_split(
            matrix,targets,test_size=0.2)

     return trainX,testy

test_sentences = open('testing_proj.valid').readlines()

model = fasttext.load_model("model_testing_proj.bin")
trainX,testy = split_df(test_sentences)

# label the data
labels,probabilities = model.predict([re.sub('\n',' ',sentence) 
                                                     for sentence in test_sentences])
auc = roc_auc_score(testy,probabilities)
print('ROC AUC=%.3f' % (auc))

# convert fasttext multilabel results to a binary classifier (probability of TRUE)
labels = list(map(lambda x: x == ['__label__nonsec-report'] or x == ['__label__sec-report'],labels))
probabilities = [probability[0] if label else (1-probability[0]) 
                 for label,probability in zip(labels,probabilities)]

auc = roc_auc_score(testy,probabilities)
print('ROC AUC=%.3f' % (auc))

已编辑 我无法解决的问题是计算 ROC 和 AUC,因为我无法弄清楚如何将数据表示到数据框中,并且测试拆分大小应该与预测概率列表相同。train_test_split 方法做到了不接受拆分 .txt 文件,这就是用于将验证数据转换为数据框格式的原因。这让我犯了错误,因为我需要确保测试拆分的大小与预测概率相同(这是我对错误的理解,如果我错了,请纠正我?)。

完整的回溯信息如下:

Warning : `load_model` does not return WordVectorModel or SupervisedModel any more,but a `FastText` object which is very similar.
Loading data ...
Traceback (most recent call last):
  File "/home/sultan/brclassifications/fasttext_classifications/temp_test.py",line 51,in <module>
    auc = roc_auc_score(testy,probabilities)
  File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/utils/validation.py",line 63,in inner_f
    return f(*args,**kwargs)
  File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_ranking.py",line 542,in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_base.py",line 77,in _average_binary_score
    return binary_metric(y_true,y_score,sample_weight=sample_weight)
  File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_ranking.py",line 330,in _binary_roc_auc_score
    fpr,tpr,_ = roc_curve(y_true,File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/utils/validation.py",line 913,in roc_curve
    fps,tps,thresholds = _binary_clf_curve(
  File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_ranking.py",line 693,in _binary_clf_curve
    check_consistent_length(y_true,sample_weight)
  File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/utils/validation.py",line 319,in check_consistent_length
    raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [40,200]

解决方法

在您的代码中的某个时刻,您正在传递 2 个向量:

  1. 它包含 20 个观察值
  2. 它包含 400 个观察值

因此,函数不知道如何协调这些样本。

错误应该是当您在某个时刻将预测值与实际值进行比较时。

我可以建议使用 plot_roc_curve() 函数吗。

您可以将其导入 from sklearn.metrics import plot_roc_curve()

然后,如果您在 Jupyter 上按 shift + tab 键,则会出现有关如何使用该功能的说明。