stm包中的fitNewDocuments之后的新docnum列

问题描述

我有5354条新闻文章的语料库,里面有很多重复的文章。使用stm包,我为906篇独特的文章运行了stm模型,并使用alignCorpus和fitNewDocuments将模型应用到语料库的其余部分。然后,我使用make.dt制作数据表,以生成整个语料库的theta值。此过程创建了一个新列,称为“ docnum”。我希望这是分配给每个文档的一列单独的数字,但是,它包括数字1-906,并且每个数字出现5-6次并且对应于相同的theta值集(请参见屏幕截图)。我认为它不应该这样做,但是我不明白为什么会这样。对于alignCorpus和fitNewDocuments函数(stm包的一部分),我在Internet上找不到很多帮助,因此,我很感谢您对这里可能发生的事情的任何想法或建议。很难为这种情况提供一个可重现的示例,因此下面提供了我的整个过程代码和所得excel文档的屏幕截图。

temp <- textProcessor(documents = NCA4_Data_3$text[1:906],Metadata = NCA4_Data_3[1:906,],lowercase = FALSE,removestopwords = TRUE,removenumbers = TRUE,removepunctuation = TRUE,ucp = TRUE,stem = FALSE,wordLengths = c(3,Inf),sparselevel = 1,language = "en",verbose = TRUE,onlycharacter = FALSE,striphtml = TRUE,customstopwords = 
                             c("https","ads","info","privacy","com","gov","via","email","print","embedded","said","will","says","can","like","also","photo","photograph","video","credit","sen","rep","dr","mr","ms","mrs","professor","prof"),v1 = FALSE)

out <- prepDocuments(temp$documents,temp$vocab,temp$Meta,lower.thresh = 1,upper.thresh = 815,subsample = NULL,verbose = TRUE)

STM.17 <- stm(documents = out$documents,vocab = out$vocab,K = 17,data = out$Meta,prevalence = ~media_type,max.em.its = 1000,init.type = "Spectral",verbose = TRUE)

#Now we process the remaining documents
temp <- textProcessor(documents = NCA4_Data_3$text[907:nrow(NCA4_Data_3)],Metadata = NCA4_Data_3[907:nrow(NCA4_Data_3),])

#note we don't run prepCorpus here because we don't want to drop any words- we want 
#every word that showed up in the old documents.
newdocs <- alignCorpus(new = temp,old.vocab = STM.17$vocab)

#we get some helpful Feedback on what has been retained and lost in the print out.
#and Now we can fit our new held-out documents
fitNewDocuments(model = STM.17,documents = newdocs$documents,newData = newdocs$Meta,origData = out$Meta,prevalencePrior="Covariate")

# #Export excel with theta values
stm.17.datatable <- make.dt(STM.17,Meta = NCA4_Data_3)
view(stm.17.datatable)
write.xlsx(stm.17.datatable,"~/Desktop/Oct.23.2020/stm.17.datatable.xlsx")

enter image description here

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...