问题描述
我与一个语料库一起工作,该语料库包含由两位审稿人撰写的180份电影评论文件。每个文档都是由一位审阅者撰写的一部电影的审阅。前80条评论由Berardinelli撰写,其余100条由Schwartz发表。我已经计算了两位作者之间针对特定单词的共同信息。现在,我必须与文档作者及其各自的相互信息找到信息量最高的前十个单词。 (在Python注释中)通过与文档作者保持较高的相互信息来解释单词的含义。有人可以帮忙吗?在下面的代码中,我必须查找两位作者之间关于“导演”一词的共同信息。
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/xniu2/Desktop/PyData/MovieReviews'
filelists = PlaintextCorpusReader(corpus_root,'.*',encoding='latin-1')
filelists.fileids()
reviews = []
for fileid in filelists.fileids():
reviews.append(filelists.raw(fileid))
import shorttext
preprocess = shorttext.utils.standard_text_preprocessor_1()
corpus = [preprocess(article).split(' ') for article in reviews]
dtm = shorttext.utils.DocumentTermMatrix(corpus,docids = filelists.fileids())
corpus
dtm.get_token_occurences('director')
import numpy as np
import math
def entropy(p):
if sum(p) == 0:
return 0
p = p/sum(p)
p = p[ p > 0 ]
H = -sum(P*np.log2(p))
return H
dtm.get_token_occurences('director').values()
director_dis = list(dtm.get_token_occurences('director').values())
entropy(director_dis)
director_docs = list(dtm.get_token_occurences('director').keys())
director_docs
import re
count_B = 0
for item in director_docs:
m = re.search('^\d{4}\.txt$',item)
if (m):
count_B += 1
print(count_B)
import re
count_S = 0
for item in director_docs:
m = re.search('^\d{5}\.txt$',item)
if (m):
count_S += 1
print(count_S)
# In[51]:
#make an array,rows represent "Berardinelli" and "Schwartz" respectively. Columns represent the number of reviews that contains the word "director" and the number of reviews that do NOT contain the word "director"
array = np.reshape((count_B,80-count_B,count_S,100-count_S),(2,2))
array
np.sum(array,axis = 0)
np.sum(array,axis = 1)
marginal_entropy = entropy(np.sum(array,axis = 1))
column_probs = np.sum(array,axis = 0)/180
column_probs
column_entropy = np.apply_along_axis(entropy,array)
column_entropy
conditional_entropy = sum(column_probs*column_entropy)
# In[62]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/xniu2/Desktop/PyData/MovieReviews'
filelists = PlaintextCorpusReader(corpus_root,array)
column_entropy
conditional_entropy = sum(column_probs*column_entropy)
conditional_entropy
MI_director_authors = marginal_entropy - conditional_entropy
MI_director_authors
conditional_entropy
#calculate the mutual @R_899_4045@ion between the word "director" and the two authors
MI_director_authors = marginal_entropy - conditional_entropy
MI_director_authors
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)