问题描述
这是示例数据,其中需要将两个表中的人全名左连接在一起,df1
和左侧表,df2
作为右侧:
df1 <- data.frame(fullName = 'Michael Gadson',age = 53) %>%
rbind(data.frame(fullName = 'Mike Gardnero',age = 43)) %>%
rbind(data.frame(fullName = 'Nicholas Richards',age = 13)) %>%
rbind(data.frame(fullName = 'Mikey Richards',age = 53)) %>%
rbind(data.frame(fullName = 'DeAndre Jamison',age = 28)) %>%
rbind(data.frame(fullName = 'Anthony Allison',age = 21)) %>%
rbind(data.frame(fullName = 'Ricky Smith',age = 82)) %>%
rbind(data.frame(fullName = 'Smith Rickie',age = 60)) %>%
rbind(data.frame(fullName = 'Johnny Williams',age = 60))
df2 <- data.frame(playerName = 'Mike Gadson',color = 'red') %>%
rbind(data.frame(playerName = 'Anthony Allison',color = 'green')) %>%
rbind(data.frame(playerName = 'Mike Gardnero',color = 'purple')) %>%
rbind(data.frame(playerName = "De Andre' Jamison",color = 'orange')) %>%
rbind(data.frame(playerName = 'Nicholas Richards III',color = 'yellow')) %>%
rbind(data.frame(playerName = 'John Kind',color = 'grey')) %>%
rbind(data.frame(playerName = 'Mike Richards',color = 'white')) %>%
rbind(data.frame(playerName = 'Rick Smith',color = 'blue')) %>%
rbind(data.frame(playerName = 'Smith Rickie',color = 'black')) %>%
rbind(data.frame(playerName = 'Anthony Albados',color = 'violet'))
output_df <- data.frame(fullName = 'Michael Gadson',age = 53,playerName = 'Mike Gadson',color = 'red') %>%
rbind(data.frame(fullName = 'Mike Gardnero',age = 43,playerName = 'Mike Gardnero',color = 'purple')) %>%
rbind(data.frame(fullName = 'Nicholas Richards',age = 13,playerName = 'Nicholas Richards III',color = 'yellow')) %>%
rbind(data.frame(fullName = 'Mikey Richards',playerName = 'Mike Richards',color = 'white')) %>%
rbind(data.frame(fullName = 'DeAndre Jamison',age = 28,playerName = "De Andre' Jamison",color = 'orange')) %>%
rbind(data.frame(fullName = 'Anthony Allison',age = 21,playerName = 'Anthony Allison',color = 'green')) %>%
rbind(data.frame(fullName = 'Ricky Smith',age = 82,playerName = 'Rick Smith',color = 'blue')) %>%
rbind(data.frame(fullName = 'Smith Rickie',age = 60,playerName = 'Smith Rickie',color = 'black')) %>%
rbind(data.frame(fullName = 'Johnny Williams',playerName = NA,color = NA))
> output_df
fullName age playerName color
1 Michael Gadson 53 Mike Gadson red
2 Mike Gardnero 43 Mike Gardnero purple
3 Nicholas Richards 13 Nicholas Richards III yellow
4 Mikey Richards 53 Mike Richards white
5 DeAndre Jamison 28 De Andre' Jamison orange
6 Anthony Allison 21 Anthony Allison green
7 Ricky Smith 82 Rick Smith blue
8 Smith Rickie 60 Smith Rickie black
9 Johnny Williams 60 <NA> <NA>
这里有一些关于棘手情况/边缘情况的评论:
- 这是左联接,因此
output_df
应该具有与左侧数据框df1
相同的行数。 - 左连接不应混用相似的名称。
Michael Gadson
->Mike Gadson
,而不是其他Mike名字之一。 - 左连接不应被反向名称混淆。 (
Ricky Smith
->Rick Smith
,而不是Smith Rickie
) - 左连接不应混用名称的后缀
III
或多余的空格或符号(De Andre'
与DeAndre
)
编辑:我尝试了以下输出:
zed <- fuzzyjoin::stringdist_left_join(x=df1,y=df2,max_dist = 0.3,by=c('fullName'='playerName'),method = 'jaccard')
> zed
fullName age playerName color
1 Michael Gadson 53 Mike Gadson red
2 Mike Gardnero 43 Mike Gadson red
3 Mike Gardnero 43 Mike Gardnero purple
4 Nicholas Richards 13 Nicholas Richards III yellow
5 Mikey Richards 53 Mike Richards white
6 DeAndre Jamison 28 De Andre' Jamison orange
7 Anthony Allison 21 Anthony Allison green
8 Richard Smith 82 Rich Smith blue
9 Smith Rickie 60 Rich Smith blue
10 Smith Rickie 60 Smith Rickie black
11 Johnny Williams 60 <NA> <NA>
它做的还不错,但还不完善。最值得注意的是,在Mike Gardnero
为0.3的情况下使用Smith Rickie
时,jaccard
和max_dist
是重复的,因为右侧有多个满足相似性标准的行。但是,我们的输出不应创建这些重复项(可能保持相似性最高的右侧值)。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)