问题描述
我有两个示例数据帧,df1
和df2
,如下所示。
df1
列出了选定的网球比赛灯具,其中包含球员姓名(player1_name
,player_name2
)和比赛日期。此处全名用于播放器。
df2
列出每个日期的所有网球比赛结果列表(winner
,loser
)。在此,使用名字的首字母和姓氏。
灯具和结果的播放器名称是从其他网站上刮下来的。因此,在某些情况下,姓氏可能不完全匹配。
考虑到这一点,我想在df1
中添加一个新列,说明玩家1或玩家2是否获胜。基本上,我想通过给定相同日期的部分匹配,将player1_name
的{{1}}和player2_name
映射到df2的赢家和输家。
df1
dput(df1)
structure(list(date = structure(c(18534,18534,18534),class = "Date"),player1_name = c("Laslo Djere","Hugo Dellien","Quentin Halys","Steve Johnson","Henri Laaksonen","Thiago Monteiro","Andrej Martin"),player2_name = c("Kevin Anderson","Ricardas Berankis","marcos Giron","Roberto Carballes","Pablo Cuevas","Nikoloz Basilashvili","Joao Sousa")),row.names = c(NA,-7L
),class = "data.frame")
我创建了一个自定义函数,可以使用RecordLinkage包将字符串与字符串向量中最接近的字符串进行匹配。我可以使用此功能编写效率极低的代码,但在转到那里之前,我想看看是否可以更高效地做到这一点。
dput(df2)
structure(list(date = structure(c(18534,winner = c("L Harris","M Berrettini","M Polmans","C Garin","A Davidovich Fokina","D Lajovic","K Anderson","R Berankis","M Giron","A Rublev","N Djokovic","R Carballes Baena","A Balazs","P Cuevas","T Monteiro","S Tsitsipas","D Shapovalov","G Dimitrov","R Bautista Agut","A Martin"),loser = c("A Popyrin","V Pospisil","U Humbert","P Kohlschreiber","H Mayot","G Mager","L Djere","H Dellien","Q Halys","S Querrey","M Ymer","S Johnson","Y Uchiyama","H Laaksonen","N Basilashvili","J Munar","G Simon","G Barrere","R Gasquet","J Sousa"
)),-20L),class = "data.frame")
解决方法
我使用stringdist
进行了尝试:
library(stringdist)
for (i in 1:nrow(df1)) {
#this first part combines the names of player1 and player2
#and finds the closest match to the player combinations in df2
d <-
stringdist(
paste(df1$player1_name[i],df1$player2_name[i]),paste(df2$winner,df2$loser),method = "cosine")
#I like using the cosine method as it returns a decimal as opposed to an integer
#then,added winner and loser columns to df1 based on which row in df2 had the closest match
#(i.e. lowest stringdist)
df1$winner[i] <- df2[which(d == min(d)),2]
df1$loser[i] <- df2[which(d == min(d)),3]
}
#adding another loop that makes the names in the winner/loser columns
#change to their closest match in the player1 and player2 columns
for(i in 1:nrow(df1)){
n <- stringdist(df1$winner[i],c(df1$player1_name[i],method = "cosine")
if (n[1] > n[2]){df1$winner[i] <- df1$player2_name[i]
df1$loser[i] <- df1$player1_name[i]}
if (n[1] < n[2]){df1$winner[i] <- df1$player1_name[i]
df1$loser[i] <- df1$player2_name[i]}
}
> df1
date player1_name player2_name winner loser
1 2020-09-29 Laslo Djere Kevin Anderson Kevin Anderson Laslo Djere
2 2020-09-29 Hugo Dellien Ricardas Berankis Ricardas Berankis Hugo Dellien
3 2020-09-29 Quentin Halys Marcos Giron Marcos Giron Quentin Halys
4 2020-09-29 Steve Johnson Roberto Carballes Roberto Carballes Steve Johnson
5 2020-09-29 Henri Laaksonen Pablo Cuevas Pablo Cuevas Henri Laaksonen
6 2020-09-29 Thiago Monteiro Nikoloz Basilashvili Thiago Monteiro Nikoloz Basilashvili
7 2020-09-29 Andrej Martin Joao Sousa Andrej Martin Joao Sousa