问题描述
d1 <- data.frame(depto=c("antioquia","arauca","cauca","popayan cauca","guayabal cundinamarca","cundinamarca","fresno - tolima","tolima","santander","norte santander"))
d2 <- data.frame(depto=c("Antioquia","arauca","Cauca","Cundinamarca","Vichada","Tolima","norte de Santander","Valle del Cauca","Santander"),id=c(1,2,3,4,5,6,7,8,9))
变量“depto”应该是相同的,但有一些不同。我尝试使用 stringdist 来匹配两个数据帧。
stringdist_left_join(d1,d2,by ="depto",distance_col = NULL)
结果如下:
1 antioquia Antioquia 1
2 arauca arauca 2
3 arauca Cauca 3
4 arauca arauca 2
5 arauca Cauca 3
6 cauca arauca 2
7 cauca Cauca 3
8 popayan cauca <NA> NA
9 guayabal cundinamarca <NA> NA
10 cundinamarca Cundinamarca 4
11 cundinamarca Cundinamarca 4
12 fresno - tolima <NA> NA
13 tolima Tolima 6
14 santander Santander 9
15 norte santander <NA> NA
我想知道一种改进方法。第一个问题是 Cauca 和 arauca 的部门总是匹配为相同。第二个问题是d1中的一些部门包括自治市(例如“guayabal cundinamarca”)但我希望匹配方法知道是否有一些额外的单词但部门名称也包含在字符串中以进行匹配它(在这种情况下是昆迪纳马卡)。第三个问题是一些部门名称有相同的词但不同(例如 norte de Santander 和 Santander),有时不匹配。谢谢
解决方法
如果 depto
列通常可靠(即您在数据输入方面没有拼写错误等重大问题),您可以使用 regex_left_join
包中的 fuzzyjoin
:>
library(fuzzyjoin)
d1 <- data.frame(depto=c("antioquia","arauca","cauca","popayan cauca","guayabal cundinamarca","cundinamarca","fresno - tolima","tolima","santander","norte santander"))
d2 <- data.frame(depto=c("Antioquia","Arauca","Cauca","Cundinamarca","Vichada","Tolima","Norte de Santander","Valle del Cauca","Santander"),id=c(1,2,3,4,5,6,7,8,9))
fuzzyjoin::regex_left_join(d1,d2,by ="depto",ignore_case = TRUE)
输出:
depto.x depto.y id
1 antioquia Antioquia 1
2 arauca Arauca 2
3 arauca Arauca 2
4 cauca Cauca 3
5 popayan cauca Cauca 3
6 guayabal cundinamarca Cundinamarca 4
7 cundinamarca Cundinamarca 4
8 cundinamarca Cundinamarca 4
9 fresno - tolima Tolima 6
10 tolima Tolima 6
11 santander Santander 9
12 norte santander Santander 9