问题描述
我有一个包含 50,176 条推文的数据集(tweets_data: 50176 obs. of 1 variable)。现在,我已经创建了一个自制的词典(formal_lexicon),它由大约100万个单词组成,都是正式的语言风格。现在,我想创建一个小代码,每条推文计算该词典中有多少(如果有)单词。
tweets_data:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
formal_lexicon:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
因此输出应如下所示:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
应该是一个简单的 for 循环,例如:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
我不认为“单词”有效,而且如果某个单词在词典中,我不确定如何在特定列中进行计数。有人可以帮忙吗?
structure(list(X = c("admittedly","consequently","conversely","considerably","essentially","furthermore")),row.names = c(NA,6L),class = "data.frame")
c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ","happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art","2017 resolution: to embody authenticity!","Happy Holidays! Sending love and light to every corner of the earth \U0001f381","damn,it's hard to wrap presents when you're drunk. cc @santa","When my whole fam tryna have a peaceful holiday " )
解决方法
你可以试试这样的:
library(tidytext)
library(dplyr)
# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly","consequently","conversely","considerably","essentially","furthermore")),row.names = c(NA,6L),class = "data.frame")
tweets_data <- c("@barackobama Thank you for your incredible grace in leadership and for being an exceptional… ","happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles County Museum of Art","2017 resolution: to embody authenticity!","Happy Holidays! Sending love and light to every corner of the earth \U0001f381","Damn,it's hard to wrap presents when you're drunk. cc @santa","When my whole fam tryna have a peaceful holiday " )
# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data,id = 1:length(tweets_data))
tweets_data_df %>%
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id,Content,cnt)
结果:
Joining,by = "id"
# A tibble: 6 x 3
id Content cnt
<int> <chr> <dbl>
1 1 "@barackobama Thank you for your incredible grace in leadership a~ 0
2 2 "happy 96th gma #fourmoreyears! \U0001f388 @ LACMA Los Angeles Co~ 0
3 3 "2017 resolution: to embody authenticity!" 0
4 4 "Happy Holidays! Sending love and light to every corner of the ea~ 0
5 5 "Damn,it's hard to wrap presents when you're drunk. cc @santa" 0
6 6 "When my whole fam tryna have a peaceful holiday " 0
,
希望对你有用:
library(magrittr)
library(dplyr)
library(tidytext)
# Data frame with tweets,including an ID
tweets <- data.frame(
id = 1:3,text = c(
'Hello,this is the first tweet example to your answer','I hope that my response help you to do your task','If it is tha case,please upvote and mark as the correct answer'
)
)
lexicon <- data.frame(
word = c('hello','first','response','task','correct','upvote')
)
# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
tidytext::unnest_tokens(output = 'words',input = text) %>%
# Determining if a word is in your lexicon
dplyr::mutate(
in_lexicon = words %in% lexicon$word
) %>%
dplyr::group_by(id) %>%
dplyr::summarise(words_in_lexicon = sum(in_lexicon))
# Binding count and the original data
dplyr::left_join(tweets,in_lexicon)