TF-IDF如何计算余弦相似度后返回五篇相关文章

问题描述

我得到一个数据框 sample_df (4列: paper_id title abstract body_text )。我提取了摘要列(每个摘要约1000个单词)并应用了文本清理过程。这是我的问题:

计算完问题和摘要之间的余弦相似度后,如何返回前5条文章的得分以及相应的信息(例如 paper_id title body_text ),因为我的目标是进行tf -idf问题解答。

我真的很抱歉我的英语不好,而且我是nlp的新手。如果有人可以提供帮助,我将不胜感激。

from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity  

txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']

tfidf_vector = TfidfVectorizer()

tfidf = tfidf_vector.fit_transform(txt_cleaned)

tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()

related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities[related_docs_indices]

#output([0.18986527,0.18339485,0.14951123,0.13441914]) 

解决方法

首先:如果您要发表5篇文章,则必须使用[:-5:-1]而不是[:-6:-1],因为对于负值,它的作用几乎没有什么不同。

或使用[::-1][:5]-[::-1]将反转所有值,然后您可以使用普通的[:5]


拥有related_docs_indices后,您可以使用.iloc[]DataFrame中获取元素

 sample_df.iloc[ related_docs_indices ]

如果元素具有相同的相似性,则会以相反的顺序给出它们。


顺便说一句:

您还可以将similarities添加到DataFrame

sample_df['similarity'] = cosine_similarities

然后排序(反转)并获得5个项目。

sample_df.sort_values('similarity',ascending=False)[:5]

如果元素具有相同的相似性,则会按原始顺序给出它们。


带有一些数据的最小工作代码-每个人都可以复制和测试它。

因为我在DataFrame中只有5个元素,所以我搜索了2个元素。

from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity  

import pandas as pd

sample_df = pd.DataFrame({
    'paper_id': [1,2,3,4,5],'title': ['Covid19','Flu','Cancer','Covid19 Again','New Air Conditioners'],'abstract': ['covid19','flu','cancer','covid19','air conditioner'],'body_text': ['Hello covid19','Hello flu','Hello cancer','Hello covid19 again','Buy new air conditioner'],})

def get_cleaned_text(df,row):
    return row

txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']

tfidf_vector = TfidfVectorizer()

tfidf = tfidf_vector.fit_transform(txt_cleaned)

tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()

sample_df['similarity'] = cosine_similarities

number = 2
#related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
related_docs_indices = cosine_similarities.argsort()[::-1][:number]

print('index:',related_docs_indices)

print('similarity:',cosine_similarities[related_docs_indices])

print('\n--- related_docs_indices ---\n')

print(sample_df.iloc[related_docs_indices])

print('\n--- sort_values ---\n')

print( sample_df.sort_values('similarity',ascending=False)[:number] )

结果:

index: [3 0]
similarity: [0.62791376 0.62791376]

--- related_docs_indices ---

   paper_id          title abstract            body_text  similarity
3         4  Covid19 Again  covid19  Hello covid19 again    0.627914
0         1        Covid19  covid19        Hello covid19    0.627914

--- sort_values ---

   paper_id          title abstract            body_text  similarity
0         1        Covid19  covid19        Hello covid19    0.627914
3         4  Covid19 Again  covid19  Hello covid19 again    0.627914

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...