根据 commonin R 中的最大单词数合并两个数据帧

问题描述

我有两个 data.frame,一个包含部分名称,另一个包含全名,如下

partial <- data.frame( "partial.name" = c("Apple","Apple","WWF","wizz air","WeMove.eu","ILU")
full <- data.frame("full.name" = c("Apple Inc","wizzair","We Move Europe","World Wide Fundation (WWF)","(ILU)","Ilusion")

在理想的世界中,我希望有一个这样的表(我真正的部分 df 有 12 794 行)

print(partial)
partial full
Apple   Apple Inc
Apple   Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows

对于没有答案的每一行,我都想成为 NA

我尝试了很多东西,fuzzyjoinregexregex_left_join 甚至是 sqldf 包。我有一些结果,但我知道如果 regex_left_join 明白我正在寻找我在 stringr 中知道的单词,boundary( type = c("word")) 存在但我不知道如何实现它会更好。

现在,我只准备了部分 df,以去除非字母数字信息并使其小写。

partial$regex <- str_squish((str_replace_all(partial$partial.name,regex("\\W+")," ")))
partial$regex <- tolower(partial$regex)

如何根据共同词的最大数量partial$partial.name full$full.name 匹配?

解决方法

部分字符串匹配需要很长时间才能正确匹配。我相信 Jaro-Winkler 距离是一个不错的选择,但您需要花时间调整参数。这是一个让你开始的例子。

library(stringdist)

partial <- data.frame( "partial.name" = c("Apple","Apple","WWF","wizz air","WeMove.eu","ILU",'None'),stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc","wizzair","We Move Europe","World Wide Foundation (WWF)","(ILU)","Ilusion"),stringsAsFactors = F)

mydist <- function(partial,list_of_fulls,method='jw',p = 0,threshold = 0.4) {
    find_dist <- function(first,second,method = method,p = p) {
        stringdist(a = first,b = second,p = p)
    }
    distances <- unlist(lapply(list_of_fulls,function(full) find_dist(first = full,second = partial,p = p)))
    # If the distance is too great assume NA 
    if (min(distances) > threshold) {
        NA
    } else {
        closest_index <- which.min(distances)
        list_of_fulls[closest_index]
    }
}

partial$match <- unlist(lapply(partial$partial.name,function(partial) mydist(partial = partial,list_of_fulls = full$full.name,method = 'jw')))

partial
#  partial.name                       match
#1        Apple                   Apple Inc
#2        Apple                   Apple Inc
#3          WWF World Wide Foundation (WWF)
#4     wizz air                     wizzair
#5    WeMove.eu              We Move Europe
#6          ILU                       (ILU)
#7         None                        <NA>