问题描述
我正在尝试通过将凌乱的站点名称列表与批准的列表进行匹配来清理数据库。
例如,首选站点名称可能是“ Cotswold Water Park Pit 28”,但该站点已以“ Pit 28”,“ 28”,“ cwp Pit 28”和“ Cotswold 28”输入数据库。 '。
数据看起来像这样:
approved <- c("Cotswold Water Park Pit 28","Cotswold Water Park Pit 14","Robinswood Hill")
messy <- c("Pit 28","28","cwp Pit 28","Cotswold 28","14","Robinswood")
我正在寻找一种方法,以将messy
中每个元素中的单词/数字与approved
中每个元素中的单词/数字进行匹配。理想情况下,我会得到这样的结果:
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "cwp Pit 28" NA
[3,] "cwp Pit 28" "14" NA
[4,] "Cotswold 28" NA NA
approved
元素构成列名,并且messy
中包含匹配单词/数字的任何元素都出现在该列的单元格中。我知道会有一些错误的比赛。很好,我可以稍后手动过滤它们,并且可以从模式匹配中排除“森林”和“山坡”等常见词。
通过使用messy
拆分regex
中的每个元素,我已经可以使用上述示例数据获得所需的结果,但是随后我要处理列表中的单词/数字列表网站名称,而我一直不得不使用嵌套循环或sapply
将它们与批准的元素匹配,因为grep
,grepl
和str_detect
之类的功能仅允许一种模式。由于数据库很大,因此将其应用于整个过程已经花费了很长时间。我真正想要的是一个功能:
match(any word in approved[1],any word in messy[1])
要么给我一个TRUE FALSE
的输出,要么提取messy[1]
(如果匹配)会很棒!
解决方法
也许您正在寻找adist
:
x <- adist(messy,approved,fixed=FALSE,ignore.case = TRUE)
y <- t(adist(approved,messy,ignore.case = TRUE))
i <- x == apply(x,1,min)
y[!i] <- NA
colnames(y) <- approved
i <- apply(y == apply(y,min,na.rm=TRUE),2,function(i) messy[i & !is.na(i)])
do.call(cbind,lapply(i,function(x) x[seq_len(max(lengths(i)))]))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "14" "Robinswood"
#[2,] "28" NA NA
#[3,] "CWP Pit 28" NA NA
#[4,] "Cotswold 28" NA NA
,
基本R选项为:
result <- sapply(approved,function(x) grep(gsub('\\s+','|',x),value = TRUE))
result
#$`Cotswold Water Park Pit 28`
#[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
#$`Cotswold Water Park Pit 14`
#[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
#$`Robinswood Hill`
#[1] "Robinswood"
这里的逻辑是,我们在|
中的每个空白处插入管道(approved
)符号,如果有匹配的单词,则在messy
中返回该单词。
要获得与所示格式相同的输出,可以执行以下操作:
sapply(result,`[`,1:max(lengths(result)))
# Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
#[1,] "Pit 28" "Pit 28" "Robinswood"
#[2,] "28" "CWP Pit 28" NA
#[3,] "CWP Pit 28" "Cotswold 28" NA
#[4,] "Cotswold 28" "14" NA
,
tidyverse / tidytext解决方案
首先将它们转换为数据帧
require(tidyverse)
require(tidytext)
## create dataframe for approved
approved <- c("Cotswold Water Park Pit 28","Cotswold Water Park Pit 14","Robinswood Hill")
## create dataframe for messy
messy <- c("Pit 28","28","CWP Pit 28","Cotswold 28","14","Robinswood")
然后使用tidytext将它们拆分为1个单词= 1行,我希望在行数发生变化时添加ID ...
## split into words
approved_df <-
tibble(approved = approved) %>%
rownames_to_column('approved_id') %>%
unnest_tokens(words,'words',drop = FALSE)
approved_df %>% head
# A tibble: 6 x 3
# approved_id approved words
# <chr> <chr> <chr>
# 1 1 Cotswold Water Park Pit 28 cotswold
# 2 1 Cotswold Water Park Pit 28 water
# 3 1 Cotswold Water Park Pit 28 park
# 4 1 Cotswold Water Park Pit 28 pit
# 5 1 Cotswold Water Park Pit 28 28
# 6 2 Cotswold Water Park Pit 14 cotswold
messy_df <-
tibble(messy = messy) %>%
rownames_to_column('messy_id') %>%
unnest_tokens(words,drop = FALSE)
messy_df %>% head
# # A tibble: 6 x 3
# messy_id messy words
# <chr> <chr> <chr>
# 1 1 Pit 28 pit
# 2 1 Pit 28 28
# 3 2 28 28
# 4 3 CWP Pit 28 cwp
# 5 3 CWP Pit 28 pit
# 6 3 CWP Pit 28 28
最后,在单词级别将两个数据框连接起来,计算重叠的单词数,然后为每个“混乱”字符串分配一个“批准的”
## join the data sets and rank by the number of words in the overlap
messy_df %>% left_join(approved_df) %>%
group_by(messy,messy_id,approved_id) %>%
summarise(n_row = n()) %>%
ungroup %>%
group_by(messy,messy_id) %>%
mutate(approved_rank = rank(desc(n_row))) %>%
ungroup %>%
filter(approved_rank == 1) %>%
arrange(messy_id)
# Joining,by = "words"
# # A tibble: 6 x 6
# messy messy_id approved approved_id n_row approved_rank
# <chr> <chr> <chr> <chr> <int> <dbl>
# 1 Pit 28 1 Cotswold Water Park Pit 28 1 2 1
# 2 28 2 Cotswold Water Park Pit 28 1 1 1
# 3 CWP Pit 28 3 Cotswold Water Park Pit 28 1 2 1
# 4 Cotswold 28 4 Cotswold Water Park Pit 28 1 2 1
# 5 14 5 Cotswold Water Park Pit 14 2 1 1
# 6 Robinswood 6 Robinswood Hill 3 1 1
,
这是一个高度灵活的regex_join解决方案
library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ","|",messy.df$messy )
#regex join
ans <- regex_full_join( approved.df,messy.df,by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans),messy~approved,value.var = "messy")[,-1]
# Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
# 1: 14 <NA> <NA>
# 2: <NA> 28 <NA>
# 3: CWP Pit 28 CWP Pit 28 <NA>
# 4: Cotswold 28 Cotswold 28 <NA>
# 5: Pit 28 Pit 28 <NA>
# 6: <NA> <NA> Robinswood
,
这里是使用stringi
的一种可能性(比stringr
更快,并且通常比基数R regex操作更快。此解决方案返回一个列表,当长度可变时,该列表应比矩阵更有效
library(stringi)
messy_ors <- stri_replace_all(messy," ","|")
lapply(approved,function(x) messy[stri_detect(x,regex = messy_ors)])
$`Cotswold Water Park Pit 28`
[1] "Pit 28" "28" "CWP Pit 28" "Cotswold 28"
$`Cotswold Water Park Pit 14`
[1] "Pit 28" "CWP Pit 28" "Cotswold 28" "14"
$`Robinswood Hill`
[1] "Robinswood"
如果您确实需要矩阵,则可以使用以下内容转换输出:
n <- max(lengths(out))
sapply(out,function(x) x[1:n])
,
我不确定我的以下尝试是否符合您的目的
res <- within(
expand.grid(messy,approved),matched <- do.call(
function(...) lengths(mapply(intersect,...)) > 0,unname(expand.grid(strsplit(messy," "),strsplit(approved," ")))
)
)
给予
Var1 Var2 matched
1 Pit 28 Cotswold Water Park Pit 28 TRUE
2 28 Cotswold Water Park Pit 28 TRUE
3 CWP Pit 28 Cotswold Water Park Pit 28 TRUE
4 Cotswold 28 Cotswold Water Park Pit 28 TRUE
5 14 Cotswold Water Park Pit 28 FALSE
6 Robinswood Cotswold Water Park Pit 28 FALSE
7 Pit 28 Cotswold Water Park Pit 14 TRUE
8 28 Cotswold Water Park Pit 14 FALSE
9 CWP Pit 28 Cotswold Water Park Pit 14 TRUE
10 Cotswold 28 Cotswold Water Park Pit 14 TRUE
11 14 Cotswold Water Park Pit 14 TRUE
12 Robinswood Cotswold Water Park Pit 14 FALSE
13 Pit 28 Robinswood Hill FALSE
14 28 Robinswood Hill FALSE
15 CWP Pit 28 Robinswood Hill FALSE
16 Cotswold 28 Robinswood Hill FALSE
17 14 Robinswood Hill FALSE
18 Robinswood Robinswood Hill TRUE
如果您想在帖子中显示输出,则可以在res
上玩一些技巧,例如,
res2 <- do.call(
cbind,lapply(
u <- with(subset(res,matched),split(Var1,Var2)),function(x) `length<-`(as.vector(x),max(lengths(u)))
)
)
这样
> res2
Cotswold Water Park Pit 28 Cotswold Water Park Pit 14 Robinswood Hill
[1,] "Pit 28" "Pit 28" "Robinswood"
[2,] "28" "CWP Pit 28" NA
[3,] "CWP Pit 28" "Cotswold 28" NA
[4,] "Cotswold 28" "14" NA