如何使用二元词扩展停用词列表？

问题描述

我想使用 TfidfVectorizer 来提取 bigrams。但是扩展停用词列表不适用于二元组。我该如何解决这个问题？

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
import pandas as pd

content = CORPUS
my_stop_words = text.ENGLISH_STOP_WORDS.union(['don know','good morning','happy birthday'])

vectorizer = TfidfVectorizer(stop_words=my_stop_words,max_features=25,ngram_range=(2,2))
X = vectorizer.fit_transform(content).todense()
df = pd.DataFrame(X,columns=vectorizer.get_feature_names())
df.to_csv('test.csv')

我收到了这个警告，结果没有任何改变：

Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['birthday','don',...] not in stop_words.

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

nlp python stop-words tf-idf tfidfvectorizer