计算字符串中完全匹配的单词的数量

问题描述

我有一个小标题,其中包含一个id列和一个捕获人们输入的text_entry的列。
目标:将每个人的text_entrykey进行比较,并计算出完全个键入单词的数量。
例如,如果我的输入是:

df <- tribble(~id,~text_entry,1,"It was a Saturday night in December.",2," It was a Saturday night",3,"It wuz a Sturday nite in",4,"IT WAS A SATURDAY",5,"was a Saturday"); df

key <- "It was a Saturday night in December."

然后,我需要以下内容:

df2 <- tribble(~id,~words_correct,7,# whole string perfect
               2,# first 5 words perfect
               3,# misspelled "was","Saturday" and "night"
               4,# case-sensitive
               5,"was a Saturday",3); df2                  # ok to start several words into the key

我完全采用stringr / stringi解决方案。 tidyverse始终是首选,但我迫切需要任何解决方案。

非常感谢,非常感谢您提前提供帮助和见解!

解决方法

一种方法是在空白处分割字符串,并用key计算常用字数。

library(tidyverse)

keywords <- strsplit(key,'\\s+')[[1]]

df %>%
  mutate(text = str_split(text_entry,'\\s+'),words_correct = map_dbl(text,~sum(.x %in% keywords)))

# A tibble: 5 x 3
#     id text_entry                             words_correct
#  <dbl> <chr>                                          <dbl>
#1     1 "It was a Saturday night in December."             7
#2     2 " It was a Saturday night"                         5
#3     3 "It wuz a Sturday nite in"                         3
#4     4 "IT WAS A SATURDAY"                                0
#5     5 "was a Saturday"                                   3

我们也可以在基数R中执行此操作:

df$words_correct <- sapply(strsplit(df$text_entry,function(x) sum(x %in% keywords))
,

您可以提取非空间部分并将其传递给str_detect()

library(tidyverse)

df %>%
  mutate(words_correct = map_dbl(str_extract_all(text_entry,"[^\\s]+"),~ sum(str_detect(key,.))))

# # A tibble: 5 x 3
#      id text_entry                             words_correct
#   <dbl> <chr>                                          <dbl>
# 1     1 "It was a Saturday night in December."             7
# 2     2 " It was a Saturday night"                         5
# 3     3 "It wuz a Sturday nite in"                         3
# 4     4 "IT WAS A SATURDAY"                                0
# 5     5 "was a Saturday"                                   3

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...