n-gram 表并标识文本出现的行

问题描述

我想构建一个表格,其中 n-gram 显示为一列,以及构建它们的数据帧的行号。

例如,以下代码用于构建 n-gram(在本例中为四元组):

# Libraries
library(quanteda)
library(data.table)
library(tidyverse)
library(stringr)

# Dataframe
Data <- data.frame(Column1 = c(1.222,3.445,5.621,8.501,9.302),Column2 = c(654231,12347,-2365,90000,12897),Column3 = c('A1','B2','E3','C1','F5'),Column4 = c('I bought it','The flower has a beautiful fragrance','It was bought by me','I have bought it','The flower smells good'),Column5 = c('Good','Bad','Ok','Moderate','Perfect'))

# Text column of interest
TextColumn <- Data$Column4

# Corpus
Content <-  corpus(TextColumn)

# Tokenization
Tokens <- tokens(Content,what = "word",remove_punct = TRUE,remove_symbols = TRUE,remove_numbers = FALSE,remove_url = TRUE,remove_separators = TRUE,split_hyphens = FALSE,include_docvars = TRUE,padding = FALSE)

Tokens <- tokens_tolower(Tokens)

# n-grams

quadgrams <- dfm(tokens_ngrams(Tokens,n = 4))
quadgrams_freq <- textstat_frequency(quadgrams)                  # quadgram frequency
quadgrs <- subset(quadgrams_freq,select=c(feature,frequency))
names(quadgrs) <- c("ngram","freq")
quadgrs <- as.data.table(quadgrs)

结果是

enter image description here

有没有办法提取行号,从 Column4 中考虑单词的行号。例如,上表中必须有一个包含 2(行号)的列对应于“the_flower_has_a”,再次包含 2(行号)作为“flower_has_a_beautiful”等的条目。

解决方法

您可以在 textstat_frequency() 中指定一个与组名对应的组,这将提供对原始“行号”的引用。

library("quanteda")
## Package version: 2.1.2

library("data.table")

# Dataframe
Data <- data.frame(
  Column1 = c(1.222,3.445,5.621,8.501,9.302),Column2 = c(654231,12347,-2365,90000,12897),Column3 = c("A1","B2","E3","C1","F5"),Column4 = c("I bought it","The flower has a beautiful fragrance","It was bought by me","I have bought it","The flower smells good"),Column5 = c("Good","Bad","Ok","Moderate","Perfect")
)

# Corpus
Content <- corpus(Data,text_field = "Column4")
docnames(Content) <- seq_len(nrow(Data))

# Tokenization and ngrams
Tokens <- tokens(Content,what = "word",remove_punct = TRUE,remove_symbols = TRUE,remove_url = TRUE
) %>%
  tokens_tolower() %>%
  tokens_ngrams(n = 4)

现在是小组部分:

# form the result
quadgrs <- textstat_frequency(dfm(Tokens),groups = docnames(Tokens)) %>%
  as.data.table()
setnames(quadgrs,"group","rownumber")

quadgrs[,c("feature","frequency","rownumber")]
##                      feature frequency rownumber
## 1:          the_flower_has_a         1         2
## 2:    flower_has_a_beautiful         1         2
## 3: has_a_beautiful_fragrance         1         2
## 4:          it_was_bought_by         1         3
## 5:          was_bought_by_me         1         3
## 6:          i_have_bought_it         1         4
## 7:    the_flower_smells_good         1         5

注意:

  1. 我稍微简化了您的代码,因为其中一些是不必要的或可以简化的。
  2. 频率计数现在在行(文档)内,因此如果您在多行中有相同的 ngram,它将在输出表中出现不止一次,频率在行内。如果您想重复出现在多行中的 ngram 的整体频率,则可以轻松修改此代码以反映这一点。 (如果你想要,请告诉我。)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...