sklearn.exceptions.NotFittedError：词汇表不适合或未提供-TF-IDF Python问题

问题描述

我应该编写一个python程序，从我从用户收到的一条消息中提取关键字，以便根据消息本身所写的特征来提出音乐专辑。

我的意图是使用一个.csv文件作为数据集来计算用户消息上的tf-idf，该文件包含许多音乐专辑的评论（以“专辑”，“艺术家”，“评论”的形式）

为此，我创建了一个函数，该函数返回定义如下的word_count_vector：

def get_word_c_vec():
    df = pd.read_csv("reviews.csv",delimiter=',')
    df['review'] = df['review'].apply(lambda x: preprocessing.clear_text(x))
    stopwd = stopwords.words('english')
    docs = df['review'].tolist()
    
    cv = CountVectorizer(max_df=0.85,stop_words=stopwd)

    # The method cv.fit_transform() generate a term-document matrix
    word_c_vec = cv.fit_transform(docs)

    return word_c_vec

如您所见此方法返回一个类型为

的对象

<class 'scipy.sparse.csr.csr_matrix'>

，其格式为：

(0,99991)  15
(0,97120)  7
(0,8916)   8
(0,170276) 1
(0,50353)  2
(0,170380) 3
(0,107333) 3
(0,168593) 2
 ..  ..

如您所见，我使用clear_text方法预处理数据。如果有用，这是他的实现：

def clear_text(txt):
    # Tokenization
    tokens = word_tokenize(txt)

    # Lowercase conversion
    tokens = [w.lower() for w in tokens]

    # Removing punctuation
    table = str.maketrans('','',string.punctuation)
    stripped = [w.translate(table) for w in tokens]

    # deleting all non-words
    final_wds = [w for w in stripped if w.isalpha()]

    # removing stopwords
    stop_wd = set(stopwords.words('english'))
    final_wds = [w for w in final_wds if w not in stop_wd]

    # lemmatizer
    lemtz = WordNetLemmatizer()
    final_wds = [lemtz.lemmatize(w) for w in final_wds]
    final_text = []

    for term in final_wds:
        final_text.append(term + " ")

    last = ''.join(map(str,final_text))

    return last

这时，我定义了用于根据用户消息计算tf_idf的函数：

def compute_tf_idf(word_c_vec,message):
    tf_id_transform = TfidfTransformer(smooth_idf=True,use_idf=True)
    tf_id_transform.fit(word_c_vec)

    stopwd = stopwords.words('english')
    cv = CountVectorizer(max_df=0.85,stop_words=stopwd)
    feature_names = cv.get_feature_names()
    print(feature_names)
    message = preprocessing.clear_text(message)
    message = message.tolist()
    tf_idf_vector = tf_id_transform.transform(message)

    sorted_items = sort_coo(tf_idf_vector.tocoo())

    keywords = get_topn(feature_names,sorted_items,10)

    # now print the results
    print("\nMessage:")
    print(message)
    print("\nKeywords:")
    for k in keywords:
        print(k,keywords[k])

要测试我编写的代码：

vec = get_word_c_vec()

compute_tf_idf(vec,message="Hello I'd like to have an experimental jazz album")

但是执行给了我这个错误：

Traceback (most recent call last):
  File ".../SongsBot/tf_idf.py",line 93,in <module>
    compute_tf_idf(vec,message="Hello I'd like to have an experimental jazz album")
  File ".../SongsBot/tf_idf.py",line 73,in compute_tf_idf
    feature_names = cv.get_feature_names()
  File "...\SongsBot\venv\lib\site-packages\sklearn\feature_extraction\text.py",line 1295,in get_feature_names
    self._check_vocabulary()
  File "...\SongsBot\venv\lib\site-packages\sklearn\feature_extraction\text.py",line 467,in _check_vocabulary
    raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided

不幸的是，我最近在NLP领域工作，所以如果我犯了一些大错误，我要提前道歉。预先感谢您的礼貌和提供。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

countvectorizer nlp python tf-idf