接收基于单词而不是每一行的单词簇

问题描述

我尝试使用这种方法

library(quanteda)

dataset1 <- data.frame( anumber = c(1,2,3),text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,when an unkNown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries,but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum","Contrary to popular belief,Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC,making it over 2000 years old. Richard Mcclintock,a Latin professor at Hampden-Sydney College in Virginia,looked up one of the more obscure Latin words,consectetur,from a Lorem Ipsum passage,and going through the cites of the word in classical literature,discovered the undoubtable source."))

myDfm <- dataset1 %>%
corpus() %>%
tokens(remove_punct = TRUE,remove_numbers = TRUE,remove_symbols = TRUE) %>%
dfm()%>%                         


   dfm_trim(min_termfreq = 1)
        
tstat_dist <- textstat_simil(myDfm,method = "cosine")

# hiarchical clustering the distance object
pres_cluster <- hclust(as.dist(tstat_dist))
# label with document names
pres_cluster$labels <- docnames(myDfm)
# plot as a dendrogram
plot(pres_cluster,xlab = "",sub = "",main = "Cosine distance on Token Frequency")

提取单词的单词簇,但是在最后的情节中,我收到了文件名,即我拥有的每一行。是否可以进行任何更改以接收文本单词,而不接收群集中的文档名称

我希望看到这个词:

textstat_frequency(myDfm,n = 5)
  feature frequency rank docfreq group
1     the        10    1       3   all
2      of         7    2       3   all
3   lorem         6    3       3   all
4   ipsum         6    3       3   all
5       a         5    5       2   all

解决方法

是-在计算距离时需要margin = "features"参数。 (并且您可以删除标签分配。)因此,代码的最后一部分应该是:

# compute the distance on features,not documents
tstat_dist <- textstat_simil(myDfm,method = "cosine",margin = "features")
# hiarchical clustering the distance object
pres_cluster <- hclust(as.dist(tstat_dist))
# plot as a dendrogram
plot(pres_cluster,xlab = "",sub = "",main = "Cosine Distance on Token Frequency")

但是,您应该计算距离度量,而不是用于计算层次聚类的余弦相似度。