带有两个单词的tfidf tokenizer始终返回第一个值

问题描述

我试图用这个语料库创建一个令牌：

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [["ALZHEIMER'S disEASE"],["LFACTORY"],["AGING"],["EEG"],["COGNITIVE CONTROL"]]

该语料库有单字和双字短语。 TfidfVectorizer不适用于双词短语，所以我尝试了以下方法：

def identity_tokenizer(text): return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer,lowercase=False)
txt_fitted = tfidf.fit(corpus)

尽管我试图使用语料库中的单词，但它总是返回第一个值。

i = 2
print('index: ' + str(i))
feature_name = tfidf.get_feature_names()[i]
print('value in index: ' + feature_name)

a = txt_fitted.transform([feature_name]).toarray()

print('argmax: ' + str(a.argmax()))
print('argmax value: ' + tfidf.get_feature_names()[a.argmax()])

结果：

index: 2
value in index: COGNITIVE CONTROL
argmax: 0
argmax value: AGING

我该怎么办？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

indexing tf-idf tfidfvectorizer token token