使用 stringdist 进行字符串匹配

问题描述

我有两个数据框,其部门名称与这些类似:

d1 <- data.frame(depto=c("antioquia","arauca","cauca","popayan cauca","guayabal cundinamarca","cundinamarca","fresno - tolima","tolima","santander","norte santander"))
d2 <- data.frame(depto=c("Antioquia","arauca","Cauca","Cundinamarca","Vichada","Tolima","norte de Santander","Valle del Cauca","Santander"),id=c(1,2,3,4,5,6,7,8,9))

变量“depto”应该是相同的,但有一些不同。我尝试使用 stringdist 来匹配两个数据帧。

stringdist_left_join(d1,d2,by ="depto",distance_col = NULL)

结果如下:

1              antioquia    Antioquia  1
2                 arauca       arauca  2
3                 arauca        Cauca  3
4                 arauca       arauca  2
5                 arauca        Cauca  3
6                  cauca       arauca  2
7                  cauca        Cauca  3
8          popayan cauca         <NA> NA
9  guayabal cundinamarca         <NA> NA
10          cundinamarca Cundinamarca  4
11          cundinamarca Cundinamarca  4
12       fresno - tolima         <NA> NA
13                tolima       Tolima  6
14             santander    Santander  9
15       norte santander         <NA> NA

我想知道一种改进方法。第一个问题是 Cauca 和 arauca 的部门总是匹配为相同。第二个问题是d1中的一些部门包括自治市(例如“guayabal cundinamarca”)但我希望匹配方法知道是否有一些额外的单词但部门名称也包含在字符串中以进行匹配它(在这种情况下是昆迪纳马卡)。第三个问题是一些部门名称有相同的词但不同(例如 norte de Santander 和 Santander),有时不匹配。谢谢

解决方法

如果 depto 列通常可靠(即您在数据输入方面没有拼写错误等重大问题),您可以使用 regex_left_join 包中的 fuzzyjoin:>

library(fuzzyjoin)
d1 <- data.frame(depto=c("antioquia","arauca","cauca","popayan cauca","guayabal cundinamarca","cundinamarca","fresno - tolima","tolima","santander","norte santander"))
d2 <- data.frame(depto=c("Antioquia","Arauca","Cauca","Cundinamarca","Vichada","Tolima","Norte de Santander","Valle del Cauca","Santander"),id=c(1,2,3,4,5,6,7,8,9))
fuzzyjoin::regex_left_join(d1,d2,by ="depto",ignore_case = TRUE)

输出:

                 depto.x      depto.y id
1              antioquia    Antioquia  1
2                 arauca       Arauca  2
3                 arauca       Arauca  2
4                  cauca        Cauca  3
5          popayan cauca        Cauca  3
6  guayabal cundinamarca Cundinamarca  4
7           cundinamarca Cundinamarca  4
8           cundinamarca Cundinamarca  4
9        fresno - tolima       Tolima  6
10                tolima       Tolima  6
11             santander    Santander  9
12       norte santander    Santander  9