R：从 Lexicon 中的推文中查找单词，计算它们并将数字保存在带有推文的数据框中

问题描述

我有一个包含 50,176 条推文的数据集（tweets_data: 50176 obs. of 1 variable）。现在，我已经创建了一个自制的词典（formal_lexicon），它由大约100万个单词组成，都是正式的语言风格。现在，我想创建一个小代码，每条推文计算该词典中有多少（如果有）单词。

tweets_data：

                   Content            
1                 "Blablabla"               
2                 "Hi my name is"               
3                 "Yes I need"                 
.  
.
. 
50176            "TEXT50176"

formal_lexicon：

                       X            
1                 "admittedly"               
2                 "Consequently"               
3                 "Furthermore"                 
.  
.
. 
1000000            "meanwhile"

因此输出应如下所示：

                  Content             Lexicon
1                 "TEXT1"                1
2                 "TEXT2"                3
3                 "TEXT3"                0 
.  
.
. 
50176            "TEXT50176"             2

应该是一个简单的 for 循环，例如：

for(sentence in tweets_data$Content){ 
  for(word in sentence){
    if(word %in% formal_lexicon){
         ...
}
}
}

我不认为“单词”有效，而且如果某个单词在词典中，我不确定如何在特定列中进行计数。有人可以帮忙吗？

structure(list(X = c("admittedly","consequently","conversely","considerably","essentially","furthermore")),row.names = c(NA,6L),class = "data.frame")

c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ","happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art","2017 resolution: to embody authenticity!","Happy Holidays! Sending love and light to every corner of the earth \U0001f381","damn,it's hard to wrap presents when you're drunk. cc @santa","When my whole fam tryna have a peaceful holiday " )

解决方法

你可以试试这样的：

library(tidytext)
library(dplyr)

# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly","consequently","conversely","considerably","essentially","furthermore")),row.names = c(NA,6L),class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ","happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art","2017 resolution: to embody authenticity!","Happy Holidays! Sending love and light to every corner of the earth \U0001f381","Damn,it's hard to wrap presents when you're drunk. cc @santa","When my whole fam tryna have a peaceful holiday " )

# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data,id = 1:length(tweets_data))


tweets_data_df  %>% 
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id,Content,cnt)

结果：

Joining,by = "id"
# A tibble: 6 x 3
     id Content                                                              cnt
  <int> <chr>                                                              <dbl>
1     1 "@barackobama Thank you for your incredible grace in leadership a~     0
2     2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~     0
3     3 "2017 resolution: to embody authenticity!"                             0
4     4 "Happy Holidays! Sending love and light to every corner of the ea~     0
5     5 "Damn,it's hard to wrap presents when you're drunk. cc @santa"        0
6     6 "When my whole fam tryna have a peaceful holiday "                     0

希望对你有用：

library(magrittr)
library(dplyr)
library(tidytext)

# Data frame with tweets,including an ID
tweets <- data.frame(
  id = 1:3,text = c(
    'Hello,this is the first tweet example to your answer','I hope that my response help you to do your task','If it is tha case,please upvote and mark as the correct answer'
  )
)

lexicon <- data.frame(
  word = c('hello','first','response','task','correct','upvote')
)


# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
  tidytext::unnest_tokens(output = 'words',input = text) %>% 
# Determining if a word is in your lexicon
  dplyr::mutate(
    in_lexicon = words %in% lexicon$word
  ) %>% 
  dplyr::group_by(id) %>%
  dplyr::summarise(words_in_lexicon = sum(in_lexicon))

# Binding count and the original data
dplyr::left_join(tweets,in_lexicon)

lexicon nlp r r