问题描述
我应该编写一个python程序,从我从用户收到的一条消息中提取关键字,以便根据消息本身所写的特征来提出音乐专辑。
我的意图是使用一个.csv文件作为数据集来计算用户消息上的tf-idf,该文件包含许多音乐专辑的评论(以“专辑”,“艺术家”,“评论”的形式)
为此,我创建了一个函数,该函数返回定义如下的word_count_vector:
def get_word_c_vec():
df = pd.read_csv("reviews.csv",delimiter=',')
df['review'] = df['review'].apply(lambda x: preprocessing.clear_text(x))
stopwd = stopwords.words('english')
docs = df['review'].tolist()
cv = CountVectorizer(max_df=0.85,stop_words=stopwd)
# The method cv.fit_transform() generate a term-document matrix
word_c_vec = cv.fit_transform(docs)
return word_c_vec
的对象
<class 'scipy.sparse.csr.csr_matrix'>
,其格式为:
(0,99991) 15
(0,97120) 7
(0,8916) 8
(0,170276) 1
(0,50353) 2
(0,170380) 3
(0,107333) 3
(0,168593) 2
.. ..
如您所见,我使用clear_text方法预处理数据。如果有用,这是他的实现:
def clear_text(txt):
# Tokenization
tokens = word_tokenize(txt)
# Lowercase conversion
tokens = [w.lower() for w in tokens]
# Removing punctuation
table = str.maketrans('','',string.punctuation)
stripped = [w.translate(table) for w in tokens]
# deleting all non-words
final_wds = [w for w in stripped if w.isalpha()]
# removing stopwords
stop_wd = set(stopwords.words('english'))
final_wds = [w for w in final_wds if w not in stop_wd]
# lemmatizer
lemtz = WordNetLemmatizer()
final_wds = [lemtz.lemmatize(w) for w in final_wds]
final_text = []
for term in final_wds:
final_text.append(term + " ")
last = ''.join(map(str,final_text))
return last
def compute_tf_idf(word_c_vec,message):
tf_id_transform = TfidfTransformer(smooth_idf=True,use_idf=True)
tf_id_transform.fit(word_c_vec)
stopwd = stopwords.words('english')
cv = CountVectorizer(max_df=0.85,stop_words=stopwd)
feature_names = cv.get_feature_names()
print(feature_names)
message = preprocessing.clear_text(message)
message = message.tolist()
tf_idf_vector = tf_id_transform.transform(message)
sorted_items = sort_coo(tf_idf_vector.tocoo())
keywords = get_topn(feature_names,sorted_items,10)
# Now print the results
print("\nMessage:")
print(message)
print("\nKeywords:")
for k in keywords:
print(k,keywords[k])
要测试我编写的代码:
vec = get_word_c_vec()
compute_tf_idf(vec,message="Hello I'd like to have an experimental jazz album")
但是执行给了我这个错误:
Traceback (most recent call last):
File ".../SongsBot/tf_idf.py",line 93,in <module>
compute_tf_idf(vec,message="Hello I'd like to have an experimental jazz album")
File ".../SongsBot/tf_idf.py",line 73,in compute_tf_idf
feature_names = cv.get_feature_names()
File "...\SongsBot\venv\lib\site-packages\sklearn\feature_extraction\text.py",line 1295,in get_feature_names
self._check_vocabulary()
File "...\SongsBot\venv\lib\site-packages\sklearn\feature_extraction\text.py",line 467,in _check_vocabulary
raise NotFittedError("Vocabulary not fitted or provided")
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
不幸的是,我最近在NLP领域工作,所以如果我犯了一些大错误,我要提前道歉。 预先感谢您的礼貌和提供。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)