如何在R的列中合并具有特定字符串匹配的两个数据帧?

问题描述

我有两个数据帧data1data2,它们具有如下信息:

dput(data1)

structure(list(ProfName = c("Hua (Christine) Xin","Dereck Barr-Pulliam","Lisa M. Blum","Russell  Williamson","William D. Stout","Michael F. Wade","Sheila A.  Johnston","Julie Huang","Alan Attaway","Alan Levitan","Benjamin P. Foster","Carolyn M.  Callahan"),Title = c(" PhD"," PhD"," LLM"," CPA"," MS"," PhD"),Profession = c("Assistant Professor","Assistant Professor","Instructor","Associate Professor and Director","Associate Professor","Professor","brown-Forman Professor of Accountancy"
)),row.names = c(8L,18L,25L,36L,49L,50L,56L,69L,71L,82L,88L,89L),class = "data.frame")

如下所示:

enter image description here

dput(data2)

structure(list(ProfName = c("Blandford,K     ","Okafor,A     ","Johnston,S     ","Rolen,R     ","Attaway,"Xin,H     ","Huang,Y     ","Stout,W     ","Williamson,"Callahan,C     ","Foster,B     ","Blum,L     ","Levitan,"Barr-Pulliam,D     ","Wade,M     ")),row.names = c(NA,-15L),class = "data.frame")

data2如下所示:

enter image description here

我想合并两个数据框,但是名称看起来不同。在列ProfName的两个数据帧之间仅匹配特定的字符串。数据应合并,如果名称没有任何信息,则应为空。如果TitleProfession列中没有任何信息,则ProfNameNew列应使用相同的名称

我尝试使用merge,但未提供所需的输出

merge(data1,data2,by="ProfName",all.x=TRUE,all.y = TRUE)

输出应如下所示:

enter image description here

解决方法

这是一个简单的解决方案:

library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)

data1 %<>% mutate(lname = str_extract(ProfName,"[A-Za-z\\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName,"^[A-Za-z\\-]+"))

df <- merge(data1,data2,all.y = TRUE,by = "lname")

head(df)

#          lname           ProfName.x Title                            Profession           # ProfName.y
# 1      Attaway         Alan Attaway   PhD                             Professor      Attaway,A     
# 2 Barr-Pulliam  Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam,D     
# 3    Blandford                 <NA>  <NA>                                  <NA>    Blandford,K     
# 4         Blum         Lisa M. Blum   LLM                            Instructor         Blum,L     
# 5     Callahan Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan,C     
# 6       Foster   Benjamin P. Foster   PhD                             Professor       Foster,B 
,

这项工作:

> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\\s(.*)$','\\2',ProfName))) %>% 
+   right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(,.)','\\1',ProfName))) %>% rename(new = ProfName)) %>% 
+   mutate(ProfName = coalesce(ProfName,new)) %>% 
+   select(-secName)
Joining,by = "secName"
               ProfName Title                            Profession                  new
1   Hua (Christine) Xin   PhD                   Assistant Professor          Xin,H     
2   Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam,D     
3          Lisa M. Blum   LLM                            Instructor         Blum,L     
4   Russell  Williamson   PhD                   Assistant Professor   Williamson,R     
5      William D. Stout   PhD      Associate Professor and Director        Stout,W     
6       Michael F. Wade   CPA                            Instructor         Wade,M     
7   Sheila A.  Johnston    MS                            Instructor     Johnston,S     
8           Julie Huang   PhD                   Associate Professor        Huang,Y     
9          Alan Attaway   PhD                             Professor      Attaway,A     
10         Alan Levitan   PhD                             Professor      Levitan,A     
11   Benjamin P. Foster   PhD                             Professor       Foster,B     
12 Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan,C     
13    Blandford,K       <NA>                                  <NA>    Blandford,K     
14       Okafor,A       <NA>                                  <NA>       Okafor,A     
15        Rolen,R       <NA>                                  <NA>        Rolen,R     
> 

使用的数据:

> df
               ProfName Title                            Profession
8   Hua (Christine) Xin   PhD                   Assistant Professor
18  Dereck Barr-Pulliam   PhD                   Assistant Professor
25         Lisa M. Blum   LLM                            Instructor
36  Russell  Williamson   PhD                   Assistant Professor
49     William D. Stout   PhD      Associate Professor and Director
50      Michael F. Wade   CPA                            Instructor
56  Sheila A.  Johnston    MS                            Instructor
69          Julie Huang   PhD                   Associate Professor
71         Alan Attaway   PhD                             Professor
82         Alan Levitan   PhD                             Professor
88   Benjamin P. Foster   PhD                             Professor
89 Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy
> df1
               ProfName
1     Blandford,K     
2        Okafor,A     
3      Johnston,S     
4         Rolen,R     
5       Attaway,A     
6           Xin,H     
7         Huang,Y     
8         Stout,W     
9    Williamson,R     
10     Callahan,C     
11       Foster,B     
12         Blum,L     
13      Levitan,A     
14 Barr-Pulliam,D     
15         Wade,M     
>