问题描述
我试图计算 fasttext 训练模型的 ROC 和 AUC,但我总是收到错误 ValueError: Found input variables with inconsistent numbers of samples: [40,200]
我的测试代码如下:
def split_df(data):
count_vect = CountVectorizer()
print('Loading data ...')
labels,texts = ([],[])
for line in data:
label,text = line.split(' ',1)
labels.append(label)
texts.append(text)
trainDF = pd.DataFrame()
trainDF['label'] = labels
trainDF['text'] = texts
# to fit the text in the dataframe
# You have to do some encoding before using fit. As it kNown fit() does not accept Strings.
count_vect = CountVectorizer()
matrix = count_vect.fit_transform(trainDF['text'])
encoder = LabelEncoder()
targets = encoder.fit_transform(trainDF['label'])
# split into train/test sets
trainX,testX,trainy,testy = train_test_split(
matrix,targets,test_size=0.2)
return trainX,testy
test_sentences = open('testing_proj.valid').readlines()
model = fasttext.load_model("model_testing_proj.bin")
trainX,testy = split_df(test_sentences)
# label the data
labels,probabilities = model.predict([re.sub('\n',' ',sentence)
for sentence in test_sentences])
auc = roc_auc_score(testy,probabilities)
print('ROC AUC=%.3f' % (auc))
# convert fasttext multilabel results to a binary classifier (probability of TRUE)
labels = list(map(lambda x: x == ['__label__nonsec-report'] or x == ['__label__sec-report'],labels))
probabilities = [probability[0] if label else (1-probability[0])
for label,probability in zip(labels,probabilities)]
auc = roc_auc_score(testy,probabilities)
print('ROC AUC=%.3f' % (auc))
已编辑
我无法解决的问题是计算 ROC 和 AUC,因为我无法弄清楚如何将数据表示到数据框中,并且测试拆分大小应该与预测概率列表相同。train_test_split
方法做到了不接受拆分 .txt 文件,这就是用于将验证数据转换为数据框格式的原因。这让我犯了错误,因为我需要确保测试拆分的大小与预测概率相同(这是我对错误的理解,如果我错了,请纠正我?)。
完整的回溯信息如下:
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more,but a `FastText` object which is very similar.
Loading data ...
Traceback (most recent call last):
File "/home/sultan/brclassifications/fasttext_classifications/temp_test.py",line 51,in <module>
auc = roc_auc_score(testy,probabilities)
File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/utils/validation.py",line 63,in inner_f
return f(*args,**kwargs)
File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_ranking.py",line 542,in roc_auc_score
return _average_binary_score(partial(_binary_roc_auc_score,File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_base.py",line 77,in _average_binary_score
return binary_metric(y_true,y_score,sample_weight=sample_weight)
File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_ranking.py",line 330,in _binary_roc_auc_score
fpr,tpr,_ = roc_curve(y_true,File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/utils/validation.py",line 913,in roc_curve
fps,tps,thresholds = _binary_clf_curve(
File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/metrics/_ranking.py",line 693,in _binary_clf_curve
check_consistent_length(y_true,sample_weight)
File "/home/sultan/.local/lib/python3.8/site-packages/sklearn/utils/validation.py",line 319,in check_consistent_length
raise ValueError("Found input variables with inconsistent numbers of"
ValueError: Found input variables with inconsistent numbers of samples: [40,200]
解决方法
在您的代码中的某个时刻,您正在传递 2 个向量:
- 它包含 20 个观察值
- 它包含 400 个观察值
因此,函数不知道如何协调这些样本。
错误应该是当您在某个时刻将预测值与实际值进行比较时。
我可以建议使用 plot_roc_curve()
函数吗。
您可以将其导入 from sklearn.metrics import plot_roc_curve()
。
然后,如果您在 Jupyter 上按 shift + tab
键,则会出现有关如何使用该功能的说明。