使用“比较”功能时,是否可以让 Quanteda 在词云中包含所有 n-gram不仅仅是随机子集? [R] [广达]

问题描述

我正在使用 Quanteda 的“比较”功能在 R 中生成词云,其中包含根据预定义子样本分组的 n-gram。

每次运行代码时,我生成的 wordcloud 都包含不同的 n-gram 子集。 wordcloud 中出现的 n-gram 也比我称之为“topfeatures”的要少。

我认为每次创建词云时都会包含一个随机的 n-gram 子集。

任何人都可以建议是否有办法将所有 n-gram 包含在我的词云中,而不仅仅是一个随机子集?

我在下面包含了我的代码。感谢您的任何建议!

library(quanteda)
library(plyr)

# Dataframe
df <- data.frame(
  Tre = c(0,1,2,3,3),Txt = c("Lorem ipsum dolor sit amet,consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,nascetur ridiculus mus.","Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,nascetur ridiculus mus. Donec quam felis,ultricies nec,pellentesque eu,pretium quis,sem.","Cum sociis natoque penatibus et magnis dis parturient montes,sem. Nulla consequat massa quis enim.","Donec quam felis,sem. Nulla consequat massa quis enim. Donec pede justo,fringilla vel,aliquet nec,vulputate eget,arcu.","Nulla consequat massa quis enim. Donec pede justo,arcu. In enim justo,rhoncus ut,imperdiet a,venenatis vitae,justo. ","In enim justo,justo. Nullam dictum felis eu pede mollis pretium. ","Nullam dictum felis eu pede mollis pretium. Integer tincidunt","Integer tincidunt. Cras dapibus","Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. ","Aenean vulputate eleifend tellus. Aenean leo ligula,porttitor eu,consequat vitae,eleifend ac,enim. ","Aenean leo ligula,enim. Aliquam lorem ante,dapibus in,viverra quis,feugiat a,tellus. ","Aliquam lorem ante,tellus. Phasellus viverra nulla ut metus varius laoreet."))

## Create variables to identify subsamples by treatment group
df$Treatments <- mapvalues(df$Tre,from = c(0,to = c("Baseline","T1","T2","T3"))


## Create a corpus and identify text field
corp <- corpus(df,text_field = "Txt") 

## Identify words as tokens and clean up
doc.tokens <- tokens(corp,what = "word",remove_punct = TRUE)

doc.tokens <- tokens_tolower(doc.tokens) # convert to lower case

doc.tokens <- tokens_remove(doc.tokens,pattern=c("custom word 1","custom word 2",stopwords('en')),padding = TRUE) # remove custom words + stop words

doc.tokens <- tokens_ngrams(doc.tokens,n=2,concatenator = " ") # collect 2,4-word phrases

## Create a document feature matrix (dfm)
doc.dfm <- dfm(doc.tokens)

## Wordcloud - by treatment
dfm_treat <- dfm(doc.tokens,groups="Treatments")

topfeatures(dfm_treat,100)

WC_Tre <- textplot_wordcloud(dfm_treat,min_count = 2,# minimum number of times an n-gram must appear before being included in the word cloud
                             random_order = FALSE,rotation = 0.,# proportion of words with 90 degree rotation
                             comparison=TRUE,# group words by "Treatment" categories
                             color = c("grey","#ffa600","#163182","#d92f70"),#Grey,Yellow,Blue,Pink
                             labelsize = .65,labelcolor = "#343438",fixed_aspect = F,labeloffset = 0.)

你可以看到产生的词云是不同的。 例如。词云 1:14 个黄色二元组,5 个粉红色二元组 词云2:16个黄色二元组,4个粉色二元组

Word cloud 1

Word cloud 2

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...