R中的模糊左连接人全名-处理棘手的边缘情况无法安装Fuzzyjoin

问题描述

这是示例数据,其中需要将两个表中的人全名左连接在一起,df1和左侧表,df2作为右侧:

df1 <- data.frame(fullName = 'Michael Gadson',age = 53) %>%
  rbind(data.frame(fullName = 'Mike Gardnero',age = 43)) %>%
  rbind(data.frame(fullName = 'Nicholas Richards',age = 13)) %>%
  rbind(data.frame(fullName = 'Mikey Richards',age = 53)) %>%
  rbind(data.frame(fullName = 'DeAndre Jamison',age = 28)) %>%
  rbind(data.frame(fullName = 'Anthony Allison',age = 21)) %>%
  rbind(data.frame(fullName = 'Ricky Smith',age = 82)) %>%
  rbind(data.frame(fullName = 'Smith Rickie',age = 60)) %>%
  rbind(data.frame(fullName = 'Johnny Williams',age = 60))

df2 <- data.frame(playerName = 'Mike Gadson',color = 'red') %>%
  rbind(data.frame(playerName = 'Anthony Allison',color = 'green')) %>%
  rbind(data.frame(playerName = 'Mike Gardnero',color = 'purple')) %>%
  rbind(data.frame(playerName = "De Andre' Jamison",color = 'orange')) %>%
  rbind(data.frame(playerName = 'Nicholas Richards III',color = 'yellow')) %>%
  rbind(data.frame(playerName = 'John Kind',color = 'grey')) %>%
  rbind(data.frame(playerName = 'Mike Richards',color = 'white')) %>%
  rbind(data.frame(playerName = 'Rick Smith',color = 'blue')) %>%
  rbind(data.frame(playerName = 'Smith Rickie',color = 'black')) %>%
  rbind(data.frame(playerName = 'Anthony Albados',color = 'violet'))

output_df <- data.frame(fullName = 'Michael Gadson',age = 53,playerName = 'Mike Gadson',color = 'red') %>%
  rbind(data.frame(fullName = 'Mike Gardnero',age = 43,playerName = 'Mike Gardnero',color = 'purple')) %>%
  rbind(data.frame(fullName = 'Nicholas Richards',age = 13,playerName = 'Nicholas Richards III',color = 'yellow')) %>%
  rbind(data.frame(fullName = 'Mikey Richards',playerName = 'Mike Richards',color = 'white')) %>%
  rbind(data.frame(fullName = 'DeAndre Jamison',age = 28,playerName = "De Andre' Jamison",color = 'orange')) %>%
  rbind(data.frame(fullName = 'Anthony Allison',age = 21,playerName = 'Anthony Allison',color = 'green')) %>%
  rbind(data.frame(fullName = 'Ricky Smith',age = 82,playerName = 'Rick Smith',color = 'blue')) %>%
  rbind(data.frame(fullName = 'Smith Rickie',age = 60,playerName = 'Smith Rickie',color = 'black')) %>%
  rbind(data.frame(fullName = 'Johnny Williams',playerName = NA,color = NA))

> output_df
           fullName age            playerName  color
1    Michael Gadson  53           Mike Gadson    red
2     Mike Gardnero  43         Mike Gardnero purple
3 Nicholas Richards  13 Nicholas Richards III yellow
4    Mikey Richards  53         Mike Richards  white
5   DeAndre Jamison  28     De Andre' Jamison orange
6   Anthony Allison  21       Anthony Allison  green
7       Ricky Smith  82            Rick Smith   blue
8      Smith Rickie  60          Smith Rickie  black
9   Johnny Williams  60                  <NA>   <NA>

这里有一些关于棘手情况/边缘情况的评论

  • 这是左联接,因此output_df应该具有与左侧数据框df1相同的行数。
  • 左连接不应混用相似的名称Michael Gadson-> Mike Gadson,而不是其他Mike名字之一。
  • 左连接不应被反向名称混淆。 (Ricky Smith-> Rick Smith,而不是Smith Rickie
  • 左连接不应混用名称的后缀III或多余的空格或符号(De Andre'DeAndre

编辑:我尝试了以下输出

zed <- fuzzyjoin::stringdist_left_join(x=df1,y=df2,max_dist = 0.3,by=c('fullName'='playerName'),method = 'jaccard')

> zed
            fullName age            playerName  color
1     Michael Gadson  53           Mike Gadson    red
2      Mike Gardnero  43           Mike Gadson    red
3      Mike Gardnero  43         Mike Gardnero purple
4  Nicholas Richards  13 Nicholas Richards III yellow
5     Mikey Richards  53         Mike Richards  white
6    DeAndre Jamison  28     De Andre' Jamison orange
7    Anthony Allison  21       Anthony Allison  green
8      Richard Smith  82            Rich Smith   blue
9       Smith Rickie  60            Rich Smith   blue
10      Smith Rickie  60          Smith Rickie  black
11   Johnny Williams  60                  <NA>   <NA>

它做的还不错,但还不完善。最值得注意的是,在Mike Gardnero为0.3的情况下使用Smith Rickie时,jaccardmax_dist是重复的,因为右侧有多个满足相似性标准的行。但是,我们的输出不应创建这些重复项(可能保持相似性最高的右侧值)。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)