NLP Calculate the similarity of any two articles resume version

 

 

Calculate the similarity of any two course

-Implement input a courSara course to find the top 10 courses similar to it by using python 

-Calculated the similarity from the text content of the course by using NLP technology Topic model LSI.

 

-Utilized :Python,NLTK,genius

 

 

 

基于 语料库当年为了保研究生 向tsinghua university apply the master degree 

 

就是将两个"article"的文本内容映射到topic的维度,然后再计算其相似度。

 

 

download: https://www.aminer.cn/citation

 

Calculate the similarity of any two articles

基于关键词 :https://www.jiqizhixin.com/articles/2018-11-14-17

有很多种方法可以提取出关键词

tf-idf算出词频 然后作为文本的向量

 

 

 

基于全量的英文维基百科(400多万文章,压缩后9个多G的语料)

基于LSI模型的课程索引建立完毕,我们以Andrew Ng教授的机器学习公开课为例,这门课程在我们的coursera_corpus文件的第211行,也就是:

>>> print courses_name[210]
Machine Learning

现在我们就可以通过lsi模型将这门课程映射到10个topic主题模型空间上,然后和其他课程计算相似度:
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, 8.3270084238788673), (1, 0.91295652151975082), (2, -0.28296075112669405), (3, 0.0011599008827843801), (4, -4.1820134980024255), (5, -0.37889856481054851), (6, 2.0446999575052125), (7, 2.3297944485200031), (8, -0.32875594265388536), (9, -0.30389668455507612)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])

取按相似度排序的前10门课程:
>>> print sort_sims[0:10]
[(210, 1.0), (174, 0.97812241), (238, 0.96428639), (203, 0.96283489), (63, 0.9605484), (189, 0.95390636), (141, 0.94975704), (184, 0.94269753), (111, 0.93654782), (236, 0.93601125)]

第一门课程是它自己:
>>> print courses_name[210]
Machine Learning

第二门课是Coursera上另一位大牛Pedro Domingos机器学习公开课
>>> print courses_name[174]
Machine Learning

第三门课是Coursera的另一位创始人,同样是大牛的daphne Koller教授的概率图模型公开课
>>> print courses_name[238]
Probabilistic Graphical Models

第四门课是另一位超级大牛Geoffrey Hinton的神经网络公开课,有同学评价是Deep Learning的必修课。
>>> print courses_name[203]
Neural Networks for Machine Learning

输入到住lsi主题模型中

# -*- coding:gb2312 -*-
from gensim import corpora, models, similarities
from nltk.tokenize import word_tokenize
from nltk.corpus import brown




//文本做一些格式上的处理 courses=[] temp="" for line in file('aaa'): if(line!="\n"): temp =temp+line.strip()+"\t" else: courses.append(temp) temp="" courses_name = [] for course in courses: x=course.strip().split('\t') courses_name.append(x[0].strip('#*')) print courses_name[0:3] document=['#*AD','ADdd'] document=document[0].decode('utf-8').lower() print document
//将文本分词 texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses] print texts_tokenized[0]
//去掉停用词 from nltk.corpus import stopwords english_stopwords = stopwords.words('english') texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized] print texts_filtered_stopwords[0] english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%'] //词根
from nltk.stem.lancaster import Lancasterstemmer texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords] print texts_filtered[0] st = Lancasterstemmer() texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered] print texts_stemmed[0] all_stems = sum(texts_stemmed, []) stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 2)#去掉次数为2的低频词汇 texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed] print texts f
rom gensim import corpora, models, similarities import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) // 构造词典 dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts]
//计算tf-idf tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]‘
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10) index = similarities.MatrixSimilarity(lsi[corpus]) ml_course = texts[15] ml_bow = dictionary.doc2bow(ml_course) ml_lsi = lsi[ml_bow] print ml_lsi sims = index[ml_lsi] sort_sims = sorted(enumerate(sims), key=lambda item: -item[1]) print sort_sims[1:11] #print courses_name[23

  

 

相关文章

python方向·数据分析   ·自然语言处理nlp   案例:中...
原文地址http://blog.sina.com.cn/s/blog_574a437f01019poo....
ptb数据集是语言模型学习中应用最广泛的数据集,常用该数据集...
 Newtonsoft.JsonNewtonsoft.Json是.Net平台操作Json的工具...
NLP(NaturalLanguageProcessing)自然语言处理是人工智能的一...
做一个中文文本分类任务,首先要做的是文本的预处理,对文本...