R:从 Lexicon 中的推文中查找单词,计算它们并将数字保存在带有推文的数据框中


我有一个包含 50,176 条推文的数据集(tweets_data: 50176 obs. of 1 variable)。现在,我已经创建了一个自制的词典(formal_lexicon),它由大约100万个单词组成,都是正式的语言风格。现在,我想创建一个代码,每条推文计算该词典中有多少(如果有)单词。


1                 "Blablabla"               
2                 "Hi my name is"               
3                 "Yes I need"                 
50176            "TEXT50176" 


1                 "admittedly"               
2                 "Consequently"               
3                 "Furthermore"                 
1000000            "meanwhile"   


                  Content             Lexicon
1                 "TEXT1"                1
2                 "TEXT2"                3
3                 "TEXT3"                0 
50176            "TEXT50176"             2

应该是一个简单的 for 循环,例如:

for(sentence in tweets_data$Content){ 
  for(word in sentence){
    if(word %in% formal_lexicon){


# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly","consequently","conversely","considerably","essentially","furthermore")),row.names = c(NA,6L),class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ","happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art","2017 resolution: to embody authenticity!","Happy Holidays! Sending love and light to every corner of the earth \U0001f381","Damn,it's hard to wrap presents when you're drunk. cc @santa","When my whole fam tryna have a peaceful holiday " )

# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data,id = 1:length(tweets_data))

tweets_data_df  %>% 
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns


Joining,by = "id"
# A tibble: 6 x 3
     id Content                                                              cnt
  <int> <chr>                                                              <dbl>
1     1 "@barackobama Thank you for your incredible grace in leadership a~     0
2     2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~     0
3     3 "2017 resolution: to embody authenticity!"                             0
4     4 "Happy Holidays! Sending love and light to every corner of the ea~     0
5     5 "Damn,it's hard to wrap presents when you're drunk. cc @santa"        0
6     6 "When my whole fam tryna have a peaceful holiday "                     0



# Data frame with tweets,including an ID
tweets <- data.frame(
  id = 1:3,text = c(
    'Hello,this is the first tweet example to your answer','I hope that my response help you to do your task','If it is tha case,please upvote and mark as the correct answer'

lexicon <- data.frame(
  word = c('hello','first','response','task','correct','upvote')

# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
  tidytext::unnest_tokens(output = 'words',input = text) %>% 
# Determining if a word is in your lexicon
    in_lexicon = words %in% lexicon$word
  ) %>% 
  dplyr::group_by(id) %>%
  dplyr::summarise(words_in_lexicon = sum(in_lexicon))

# Binding count and the original data