在python中定义互信息功能

问题描述

我与一个语料库一起工作,该语料库包含由两位审稿人撰写的180份电影评论文件。每个文档都是由一位审阅者撰写的一部电影的审阅。前80条评论由Berardinelli撰写,其余100条由Schwartz发表。我已经计算了两位作者之间针对特定单词的共同信息。现在,我必须与文档作者及其各自的相互信息找到信息量最高的前十个单词。 (在Python注释中)通过与文档作者保持较高的相互信息来解释单词的含义。有人可以帮忙吗?在下面的代码中,我必须查找两位作者之间关于“导演”一词的共同信息。
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/xniu2/Desktop/PyData/MovieReviews' 
filelists = PlaintextCorpusReader(corpus_root,'.*',encoding='latin-1')
filelists.fileids()




reviews = []
for fileid in filelists.fileids():
    reviews.append(filelists.raw(fileid))




import shorttext
preprocess = shorttext.utils.standard_text_preprocessor_1()
corpus = [preprocess(article).split(' ') for article in reviews]


dtm = shorttext.utils.DocumentTermMatrix(corpus,docids = filelists.fileids())



corpus

dtm.get_token_occurences('director')

import numpy as np
import math

def entropy(p):
    if sum(p) == 0:
        return 0
    
    p = p/sum(p)

    p = p[ p > 0 ]
    
    H = -sum(P*np.log2(p))
    
    return H



dtm.get_token_occurences('director').values()


director_dis = list(dtm.get_token_occurences('director').values())


entropy(director_dis)




director_docs = list(dtm.get_token_occurences('director').keys())


director_docs


import re

count_B = 0
for item in director_docs:
    m = re.search('^\d{4}\.txt$',item)
    if (m):
        count_B += 1
print(count_B)


import re

count_S = 0
for item in director_docs:
    m = re.search('^\d{5}\.txt$',item)
    if (m):
        count_S += 1
print(count_S)


# In[51]:


#make an array,rows represent "Berardinelli" and "Schwartz" respectively. Columns represent the number of reviews that contains the word "director" and the number of reviews that do NOT contain the word "director"
array = np.reshape((count_B,80-count_B,count_S,100-count_S),(2,2))




array


np.sum(array,axis = 0)


np.sum(array,axis = 1)


marginal_entropy = entropy(np.sum(array,axis = 1))


column_probs = np.sum(array,axis = 0)/180





column_probs


column_entropy = np.apply_along_axis(entropy,array)





column_entropy





conditional_entropy = sum(column_probs*column_entropy)


# In[62]:


from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/xniu2/Desktop/PyData/MovieReviews' 
filelists = PlaintextCorpusReader(corpus_root,array)





column_entropy





conditional_entropy = sum(column_probs*column_entropy)


conditional_entropy


MI_director_authors = marginal_entropy - conditional_entropy




MI_director_authors









conditional_entropy


#calculate the mutual @R_899_4045@ion between the word "director" and the two authors
MI_director_authors = marginal_entropy - conditional_entropy




MI_director_authors

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)