问题描述
我得到一个数据框 sample_df (4列: paper_id , title , abstract , body_text )。我提取了摘要列(每个摘要约1000个单词)并应用了文本清理过程。这是我的问题:
计算完问题和摘要之间的余弦相似度后,如何返回前5条文章的得分以及相应的信息(例如 paper_id , title , body_text ),因为我的目标是进行tf -idf问题解答。
我真的很抱歉我的英语不好,而且我是nlp的新手。如果有人可以提供帮助,我将不胜感激。
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities[related_docs_indices]
#output([0.18986527,0.18339485,0.14951123,0.13441914])
解决方法
首先:如果您要发表5篇文章,则必须使用[:-5:-1]
而不是[:-6:-1]
,因为对于负值,它的作用几乎没有什么不同。
或使用[::-1][:5]
-[::-1]
将反转所有值,然后您可以使用普通的[:5]
拥有related_docs_indices
后,您可以使用.iloc[]
从DataFrame
中获取元素
sample_df.iloc[ related_docs_indices ]
如果元素具有相同的相似性,则会以相反的顺序给出它们。
顺便说一句:
您还可以将similarities
添加到DataFrame
sample_df['similarity'] = cosine_similarities
然后排序(反转)并获得5个项目。
sample_df.sort_values('similarity',ascending=False)[:5]
如果元素具有相同的相似性,则会按原始顺序给出它们。
带有一些数据的最小工作代码-每个人都可以复制和测试它。
因为我在DataFrame
中只有5个元素,所以我搜索了2个元素。
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
sample_df = pd.DataFrame({
'paper_id': [1,2,3,4,5],'title': ['Covid19','Flu','Cancer','Covid19 Again','New Air Conditioners'],'abstract': ['covid19','flu','cancer','covid19','air conditioner'],'body_text': ['Hello covid19','Hello flu','Hello cancer','Hello covid19 again','Buy new air conditioner'],})
def get_cleaned_text(df,row):
return row
txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']
tfidf_vector = TfidfVectorizer()
tfidf = tfidf_vector.fit_transform(txt_cleaned)
tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
sample_df['similarity'] = cosine_similarities
number = 2
#related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
related_docs_indices = cosine_similarities.argsort()[::-1][:number]
print('index:',related_docs_indices)
print('similarity:',cosine_similarities[related_docs_indices])
print('\n--- related_docs_indices ---\n')
print(sample_df.iloc[related_docs_indices])
print('\n--- sort_values ---\n')
print( sample_df.sort_values('similarity',ascending=False)[:number] )
结果:
index: [3 0]
similarity: [0.62791376 0.62791376]
--- related_docs_indices ---
paper_id title abstract body_text similarity
3 4 Covid19 Again covid19 Hello covid19 again 0.627914
0 1 Covid19 covid19 Hello covid19 0.627914
--- sort_values ---
paper_id title abstract body_text similarity
0 1 Covid19 covid19 Hello covid19 0.627914
3 4 Covid19 Again covid19 Hello covid19 again 0.627914