为什么 Biterm 主题模型 (BTM) 返回一致性分数 -100？

问题描述

我正在使用 biterm.cbtm 库来训练一个包含大约 2500 个短帖子的主题模型。 BTM 完成后，我得到以下 10 个主题，以及如下图所示的主题一致性值：https://ibb.co/Kqy992H

我试图了解这些负面相干值的含义以及它们为何如此之低。我阅读了很多相关研究，但找不到一篇解释相干值范围的论文。此外，大多数论文都提到了 LDA 一致性值，因为 BTM 没有得到很好的记录。

有人知道我得到的相干值的范围/含义吗？为什么 -76 和 -111 之间存在一致性？

你可以在下面看到我的代码：

from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms
import numpy as np
import pyLDAvis
from biterm.cbtm import oBTM
from sklearn.feature_extraction.text import CountVectorizer
from biterm.utility import vec_to_biterms,topic_summuary # helper functions

import pickle
import pandas as pd
from numpy import array
import numpy as np
import logging
import pyLDAvis.gensim
import json
import warnings
import pickle
import pandas as pd
import re
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import Ldamodel
from gensim.corpora.dictionary import Dictionary
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
from gensim import corpora,models
from gensim.models import Phrases
import time


def docs_preprocessor(docs):
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(docs)):
        docs[idx] = re.sub(r'(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*','',docs[idx])
        docs[idx] = docs[idx].lower()  # Convert to lowercase.
        if len(docs[idx]) < 50:
            continue
        docs[idx] = tokenizer.tokenize(docs[idx])  # Split into words.
    # Remove numbers,but not words that contain numbers.
    docs = [[token for token in doc if not token.isdigit()] for doc in docs]
    # Remove words that are only one character.
    docs = [[token for token in doc if len(token) > 3] for doc in docs]
    # Lemmatize all words in documents.
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
    return docs

colnames = ['post']
with open('cleantext.p','rb') as handle:
    dict = pickle.load(handle)

dict['text'] = list(filter(None.__ne__,dict['text']))
print("Total posts: " + str(len(dict['text'])))
p_df = pd.DataFrame.from_dict(dict)#,skiprows = lambda x: logic(x))
docs = array(p_df['text'])

print("ALL DOCUMENTS: " + str(len(docs)))
docs = docs_preprocessor(docs)
outfile = open("posts.txt","w+")
total_docs = 0
for sentence in docs:
    if len(sentence) < 3:
        continue
    else:
        total_docs += 1
        for word in sentence:
            result = ''.join([i for i in word if not i.isdigit()])
            outfile.write(result + " ")
        outfile.write("\n")
outfile.close()

print("Total docs: " + str(total_docs))
print("Reading sentences. . .")
texts = open('posts.txt','r').read().splitlines()

clear_text = ""
for item in texts:
    clear_text = clear_text + " " + item

vec = CountVectorizer(stop_words='english')
print("Building Vectors. . .")
X = vec.fit_transform(texts).toarray()
print("Building Vocabulary. . .")
vocab = np.array(vec.get_feature_names())
biterms = vec_to_biterms(X)

print("BTM modelling. . .")
btm = oBTM(num_topics=10,V=vocab)

print("\n\n Train Online BTM ..")
btm.fit(biterms,iterations=100)
topics = btm.transform(biterms)

print("\n\n Topic coherence ..")
topic_summuary(btm.phi_wz.T,X,vocab,10)

#I am getting a weird error about pyLDAvis here. Why?
print("\n\n Visualize Topics ..")
vis = pyLDAvis.prepare(btm.phi_wz.T,topics,np.count_nonzero(X,axis=1),np.sum(X,axis=0))
pyLDAvis.save_html(vis,'btm.html')

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

python topic-modeling