问题描述
我有这个数据框 (DF1)
structure(list(ID = 1:3,Text = c("there was not clostridium","clostridium difficile positive","test was OK")),class = "data.frame",row.names = c(NA,-3L))
ID TEXT
1 "there was not clostridium"
2 "clostridium difficile positive"
3 "test was OK"
和数据框 (DF2)
structure(list(ID = 1:3,Microorganisms = c("ESCHERICHIA COLI","CLOSTRIDIUM DIFFICILE","FUNGI")),-3L))
ID Microorganisms
1 ESCHERICHIA COLI
2 CLOSTRIDIUM DIFFICILE
3 FUNGI
我想用正则表达式找到匹配的 DF1 和 DF2 并将它们放到这样的新列中
ID TEXT Microorganism
1 "there was not clostridium" CLOSTRIDIUM DIFFICILE
2 "clostridium difficile positive" CLOSTRIDIUM DIFFICILE
3 "test was OK" no
我试过这样的事情
DF1 %>% mutate(Mikroorganism = ifelse(grepl(DF2$Microorganisms,TEXT),str_extract(TEXT,DF2$Microorganisms),"no"))
但事实并非如此。
解决方法
一种方法是使用 fuzzyjoin
包。
DF1 %>%
fuzzyjoin::regex_left_join(
transmute(DF2,Microorganisms,ptn = gsub("\\s+","|",Microorganisms)),by = c("Text" = "ptn"),ignore_case = TRUE) %>%
select(-ptn)
# ID Text Microorganisms
# 1 1 there was not clostridium CLOSTRIDIUM DIFFICILE
# 2 2 clostridium difficile positive CLOSTRIDIUM DIFFICILE
# 3 3 test was OK <NA>