问题描述
我有两个数据帧data1
和data2
,它们具有如下信息:
dput(data1)
structure(list(ProfName = c("Hua (Christine) Xin","Dereck Barr-Pulliam","Lisa M. Blum","Russell Williamson","William D. Stout","Michael F. Wade","Sheila A. Johnston","Julie Huang","Alan Attaway","Alan Levitan","Benjamin P. Foster","Carolyn M. Callahan"),Title = c(" PhD"," PhD"," LLM"," CPA"," MS"," PhD"),Profession = c("Assistant Professor","Assistant Professor","Instructor","Associate Professor and Director","Associate Professor","Professor","brown-Forman Professor of Accountancy"
)),row.names = c(8L,18L,25L,36L,49L,50L,56L,69L,71L,82L,88L,89L),class = "data.frame")
如下所示:
dput(data2)
structure(list(ProfName = c("Blandford,K ","Okafor,A ","Johnston,S ","Rolen,R ","Attaway,"Xin,H ","Huang,Y ","Stout,W ","Williamson,"Callahan,C ","Foster,B ","Blum,L ","Levitan,"Barr-Pulliam,D ","Wade,M ")),row.names = c(NA,-15L),class = "data.frame")
data2
如下所示:
我想合并两个数据框,但是名称看起来不同。在列ProfName
的两个数据帧之间仅匹配特定的字符串。数据应合并,如果名称没有任何信息,则应为空。如果Title
和Profession
列中没有任何信息,则ProfName
和New
列应使用相同的名称。
我尝试使用merge
,但未提供所需的输出。
merge(data1,data2,by="ProfName",all.x=TRUE,all.y = TRUE)
输出应如下所示:
解决方法
这是一个简单的解决方案:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
data1 %<>% mutate(lname = str_extract(ProfName,"[A-Za-z\\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName,"^[A-Za-z\\-]+"))
df <- merge(data1,data2,all.y = TRUE,by = "lname")
head(df)
# lname ProfName.x Title Profession # ProfName.y
# 1 Attaway Alan Attaway PhD Professor Attaway,A
# 2 Barr-Pulliam Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam,D
# 3 Blandford <NA> <NA> <NA> Blandford,K
# 4 Blum Lisa M. Blum LLM Instructor Blum,L
# 5 Callahan Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan,C
# 6 Foster Benjamin P. Foster PhD Professor Foster,B
,
这项工作:
> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\\s(.*)$','\\2',ProfName))) %>%
+ right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(,.)','\\1',ProfName))) %>% rename(new = ProfName)) %>%
+ mutate(ProfName = coalesce(ProfName,new)) %>%
+ select(-secName)
Joining,by = "secName"
ProfName Title Profession new
1 Hua (Christine) Xin PhD Assistant Professor Xin,H
2 Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam,D
3 Lisa M. Blum LLM Instructor Blum,L
4 Russell Williamson PhD Assistant Professor Williamson,R
5 William D. Stout PhD Associate Professor and Director Stout,W
6 Michael F. Wade CPA Instructor Wade,M
7 Sheila A. Johnston MS Instructor Johnston,S
8 Julie Huang PhD Associate Professor Huang,Y
9 Alan Attaway PhD Professor Attaway,A
10 Alan Levitan PhD Professor Levitan,A
11 Benjamin P. Foster PhD Professor Foster,B
12 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan,C
13 Blandford,K <NA> <NA> Blandford,K
14 Okafor,A <NA> <NA> Okafor,A
15 Rolen,R <NA> <NA> Rolen,R
>
使用的数据:
> df
ProfName Title Profession
8 Hua (Christine) Xin PhD Assistant Professor
18 Dereck Barr-Pulliam PhD Assistant Professor
25 Lisa M. Blum LLM Instructor
36 Russell Williamson PhD Assistant Professor
49 William D. Stout PhD Associate Professor and Director
50 Michael F. Wade CPA Instructor
56 Sheila A. Johnston MS Instructor
69 Julie Huang PhD Associate Professor
71 Alan Attaway PhD Professor
82 Alan Levitan PhD Professor
88 Benjamin P. Foster PhD Professor
89 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy
> df1
ProfName
1 Blandford,K
2 Okafor,A
3 Johnston,S
4 Rolen,R
5 Attaway,A
6 Xin,H
7 Huang,Y
8 Stout,W
9 Williamson,R
10 Callahan,C
11 Foster,B
12 Blum,L
13 Levitan,A
14 Barr-Pulliam,D
15 Wade,M
>