根据部分字符串匹配比较两个数据帧的两列

问题描述

我有两个示例数据帧,df1df2,如下所示。 df1列出了选定的网球比赛灯具,其中包含球员姓名(player1_nameplayer_name2)和比赛日期。此处全名用于播放器。

df2列出每个日期的所有网球比赛结果列表(winnerloser)。在此,使用名字的首字母和姓氏。 灯具和结果的播放器名称是从其他网站上刮下来的。因此,在某些情况下,姓氏可能不完全匹配。 考虑到这一点,我想在df1添加一个新列,说明玩家1或玩家2是否获胜。基本上,我想通过给定相同日期的部分匹配,将player1_name的{​​{1}}和player2_name映射到df2的赢家和输家。

df1
dput(df1)
structure(list(date = structure(c(18534,18534,18534),class = "Date"),player1_name = c("Laslo Djere","Hugo Dellien","Quentin Halys","Steve Johnson","Henri Laaksonen","Thiago Monteiro","Andrej Martin"),player2_name = c("Kevin Anderson","Ricardas Berankis","marcos Giron","Roberto Carballes","Pablo Cuevas","Nikoloz Basilashvili","Joao Sousa")),row.names = c(NA,-7L
),class = "data.frame")

我创建了一个自定义函数,可以使用RecordLinkage包将字符串与字符串向量中最接近的字符串进行匹配。我可以使用此功能编写效率极低的代码,但在转到那里之前,我想看看是否可以更高效地做到这一点。

dput(df2)
structure(list(date = structure(c(18534,winner = c("L Harris","M Berrettini","M Polmans","C Garin","A Davidovich Fokina","D Lajovic","K Anderson","R Berankis","M Giron","A Rublev","N Djokovic","R Carballes Baena","A Balazs","P Cuevas","T Monteiro","S Tsitsipas","D Shapovalov","G Dimitrov","R Bautista Agut","A Martin"),loser = c("A Popyrin","V Pospisil","U Humbert","P Kohlschreiber","H Mayot","G Mager","L Djere","H Dellien","Q Halys","S Querrey","M Ymer","S Johnson","Y Uchiyama","H Laaksonen","N Basilashvili","J Munar","G Simon","G Barrere","R Gasquet","J Sousa"
    )),-20L),class = "data.frame")

解决方法

我使用stringdist进行了尝试:

library(stringdist)

for (i in 1:nrow(df1)) {
  
  #this first part combines the names of player1 and player2
  #and finds the closest match to the player combinations in df2

  d <-
    stringdist(
      paste(df1$player1_name[i],df1$player2_name[i]),paste(df2$winner,df2$loser),method = "cosine")
  #I like using the cosine method as it returns a decimal as opposed to an integer


  #then,added winner and loser columns to df1 based on which row in df2 had the closest match
  #(i.e. lowest stringdist)
 
  df1$winner[i] <- df2[which(d == min(d)),2]
  df1$loser[i] <- df2[which(d == min(d)),3]
}

#adding another loop that makes the names in the winner/loser columns
#change to their closest match in the player1 and player2 columns

for(i in 1:nrow(df1)){
  n <- stringdist(df1$winner[i],c(df1$player1_name[i],method = "cosine")
  if (n[1] > n[2]){df1$winner[i] <- df1$player2_name[i]
                   df1$loser[i] <- df1$player1_name[i]}
  if (n[1] < n[2]){df1$winner[i] <- df1$player1_name[i]
                   df1$loser[i] <- df1$player2_name[i]}
}

> df1
        date    player1_name         player2_name            winner                loser
1 2020-09-29     Laslo Djere       Kevin Anderson    Kevin Anderson          Laslo Djere
2 2020-09-29    Hugo Dellien    Ricardas Berankis Ricardas Berankis         Hugo Dellien
3 2020-09-29   Quentin Halys         Marcos Giron      Marcos Giron        Quentin Halys
4 2020-09-29   Steve Johnson    Roberto Carballes Roberto Carballes        Steve Johnson
5 2020-09-29 Henri Laaksonen         Pablo Cuevas      Pablo Cuevas      Henri Laaksonen
6 2020-09-29 Thiago Monteiro Nikoloz Basilashvili   Thiago Monteiro Nikoloz Basilashvili
7 2020-09-29   Andrej Martin           Joao Sousa     Andrej Martin           Joao Sousa