如何解决CountVectorizer ValueError：空词汇；也许文件只包含停用词？

问题描述

我正在尝试使用CountVectorizer获取tf-idf。但是我得到了错误

ValueError跟踪（最近一次调用最后） 13返回tf_idf，计数 14 ---> 15 tf_idf，count = c_tf_idf（docs_per_topic.Doc.values，m = len（data））
c_tf_idf中的
（documents，m，ngram_range） 3 4 def c_tf_idf（documents，m，ngram_range = {1，1））： ----> 5 count = CountVectorizer（ngram_range = ngram_range，stop_words =“ english”）。fit（文档） 6 t = count.transform（documents）.toarray（） 7 w = t.sum（axis = 1）

我所做并试图做的是：在上一步中，我使用下面的代码加入了同一集群中的所有文档。

docs_df = pd.DataFrame(data,columns=["Doc"])
docs_df['Topic'] = cluster.labels_
docs_df['Doc_ID'] = range(len(docs_df))
docs_per_topic = docs_df.dropna(subset=['Doc']).groupby(['Topic'],as_index = False).agg({'Doc': ' '.join})

现在我想使用CountVectorizer获得tf-idf。为此，我正在使用以下代码

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def c_tf_idf(documents,m,ngram_range=(1,1)):
    count = CountVectorizer(ngram_range=ngram_range,stop_words="english").fit(documents)
    t = count.transform(documents).toarray()
    w = t.sum(axis=1)
    tf = np.divide(t.T,w)
    sum_t = t.sum(axis=0)
    idf = np.log(np.divide(m,sum_t)).reshape(-1,1)
    tf_idf = np.multiply(tf,idf)

    return tf_idf,count
  
tf_idf,count = c_tf_idf(docs_per_topic.Doc.values,m=len(data))

我还寻找了类似的问题Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"，并尝试实现那里提供的解决方案（使用split），但这会带来另一个错误。

count = CountVectorizer(ngram_range=ngram_range,stop_words="english").fit(documents.split('\n'))

AttributeError：'numpy.ndarray'对象没有属性'split'

对于此问题的任何解决方案或建议，我们将不胜感激。

预先感谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

countvectorizer python tensorflow tf-idf