问题描述
我有一个包含城市名称的隶属关系数据框“dfa”,有时会缺少国家/地区名称,例如像第 4 行(巴格达)和第 7 行(柏林):
dfa <- data.frame(affiliation=c("DEPARTMENT OF PHARMACY,AMSTERdam UNIVERSITY,AMSTERdam,THE NETHERLANDS","DEPARTMENT OF BIOCHEMISTRY,LADY HARDINGE MEDICAL COLLEGE,NEW DELHI,INDIA.","DEPARTMENT OF PATHOLOGY,CHILDREN'S HOSPITAL,LOS ANGELES,UNITED STATES","COLLEGE OF EDUCATION FOR PURE SCIENCE,UNIVERSITY OF BAGHDAD.","DEPARTMENT OF CLINICAL LABORATORY,BEIJING GENERAL HOSPITAL,BEIJING,CHINA.","LABORATORY OF MOLEculaR BIOLOGY,ISTITUTO ORTOPEDICO,MILAN,ITALY.","DEPARTMENT OF AGRICULTURE,BERLIN INSTITUTE OF HEALTH,BERLIN","INSTITUTE OF LABORATORY MEDICINE,UNIVERSITY HOSPITAL,MUNICH,GERMANY.","DEPARTMENT OF CLINICAL PATHOLOGY,MAHIDOL UNIVERSITY,BANGKOK,THAILAND.","DEPARTMENT OF BIOLOGY,WASEDA UNIVERSITY,TOKYO,JAPAN","DEPARTMENT OF MOLEculaR BIOLOGY,MINISTRY OF HEALTH,TEHRAN,IRAN.","LABORATORY OF CARdioVASculaR disEASE,FUWAI HOSPITAL,CHINA."))
我现在有第二个数据框“dfb”,其中包含城市和相应国家/地区的列表,其中一些存在于“dfa”中:
dfb <- data.frame(city=c("AGRI","AMSTERdam","athens","AUCKLAND","BUENOS AIRES","BEIJING","BAGHDAD","BANGKOK","BERLIN","BUDApest"),country=c("TURKEY","NETHERLANDS","GREECE","NEW ZEALAND","ARGENTINA","CHINA","IRAQ","THAILAND","GERMANY","HUNGARY"))
如何仅针对同时出现在“dfa”和“dfb”中的城市(即使缺少国家/地区,如巴格达和柏林)在两个新列中添加城市和相应国家/地区?
注意:目标是添加完整城市名称,但不是其中的一部分。下面的第 7 行是不想要的示例:AGRI 城市 TURKEY 与 BERLIN 不恰当地相关联,因为该行包含“AGRICULTURE”字样。
affiliation city country
1 DEPARTMENT OF PHARMACY,THE NETHERLANDS AMSTERdam NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY,INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY,UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE,UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY,CHINA. BEIJING CHINA
6 LABORATORY OF MOLEculaR BIOLOGY,ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE,BERLIN AGRI TURKEY
8 INSTITUTE OF LABORATORY MEDICINE,GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY,THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY,JAPAN <NA> <NA>
11 DEPARTMENT OF MOLEculaR BIOLOGY,IRAN. <NA> <NA>
12 LABORATORY OF CARdioVASculaR disEASE,CHINA. BEIJING CHINA
解决方法
str_extract
与连接或另一个 str_extract
的组合是帮助您实现目标的一种选择。
str_extract
将获得它遇到的第一个值,并使用 paste0
将城市折叠成一个长 or
字符串以进行检查。
library(dplyr)
library(stringr)
dfa %>%
mutate(city = str_extract(dfa$affiliation,paste0("\\b",dfb$city,collapse = "\\b|"))) %>%
left_join(dfb,by = "city")
编辑:在 paste0
中添加了单词边界,以便仅匹配整个城市名称并避免部分匹配。
affiliation city country
1 DEPARTMENT OF PHARMACY,AMSTERDAM UNIVERSITY,AMSTERDAM,THE NETHERLANDS AMSTERDAM NETHERLANDS
2 DEPARTMENT OF BIOCHEMISTRY,LADY HARDINGE MEDICAL COLLEGE,NEW DELHI,INDIA. <NA> <NA>
3 DEPARTMENT OF PATHOLOGY,CHILDREN'S HOSPITAL,LOS ANGELES,UNITED STATES <NA> <NA>
4 COLLEGE OF EDUCATION FOR PURE SCIENCE,UNIVERSITY OF BAGHDAD. BAGHDAD IRAQ
5 DEPARTMENT OF CLINICAL LABORATORY,BEIJING GENERAL HOSPITAL,BEIJING,CHINA. BEIJING CHINA
6 LABORATORY OF MOLECULAR BIOLOGY,ISTITUTO ORTOPEDICO,MILAN,ITALY. <NA> <NA>
7 DEPARTMENT OF AGRICULTURE,BERLIN INSTITUTE OF HEALTH,BERLIN BERLIN GERMANY
8 INSTITUTE OF LABORATORY MEDICINE,UNIVERSITY HOSPITAL,MUNICH,GERMANY. <NA> <NA>
9 DEPARTMENT OF CLINICAL PATHOLOGY,MAHIDOL UNIVERSITY,BANGKOK,THAILAND. BANGKOK THAILAND
10 DEPARTMENT OF BIOLOGY,WASEDA UNIVERSITY,TOKYO,JAPAN <NA> <NA>
11 DEPARTMENT OF MOLECULAR BIOLOGY,MINISTRY OF HEALTH,TEHRAN,IRAN. <NA> <NA>
12 LABORATORY OF CARDIOVASCULAR DISEASE,FUWAI HOSPITAL,CHINA. BEIJING CHINA
,
这种方法解释了从属关系可能与多个城市名称匹配的可能性。
library(tidyverse)
dfa %>%
mutate(city = map(affiliation,~ str_extract(.x,dfb$city))) %>%
unnest(cols = c(city)) %>%
group_by(affiliation) %>%
mutate(nmatches = sum(!is.na(city))) %>%
filter((nmatches > 0 & !is.na(city)) | (nmatches == 0 & row_number() == 1)) %>%
ungroup() %>%
left_join(dfb,by = "city") %>%
mutate(country_match = str_detect(affiliation,country))
# A tibble: 12 x 5
affiliation city nmatches country country_match
<chr> <chr> <int> <chr> <lgl>
1 DEPARTMENT OF PHARMACY,… AMSTE… 1 NETHER… TRUE
2 DEPARTMENT OF BIOCHEMIS… NA 0 NA NA
3 DEPARTMENT OF PATHOLOGY… NA 0 NA NA
4 COLLEGE OF EDUCATION FO… BAGHD… 1 IRAQ FALSE
5 DEPARTMENT OF CLINICAL … BEIJI… 1 CHINA TRUE
6 LABORATORY OF MOLECULAR… NA 0 NA NA
7 BERLIN INSTITUTE OF HEA… BERLIN 1 GERMANY FALSE
8 INSTITUTE OF LABORATORY… NA 0 NA NA
9 DEPARTMENT OF CLINICAL … BANGK… 1 THAILA… TRUE
10 DEPARTMENT OF BIOLOGY,… NA 0 NA NA
11 DEPARTMENT OF MOLECULAR… NA 0 NA NA
12 LABORATORY OF CARDIOVAS… BEIJI… 1 CHINA TRUE
然后您可以使用 1 nmatches
和 country_match == F
仔细检查案例,当有 2 个或更多 nmatches
时,您可以使用 country_match == T
保留那个。>