问题描述
我遇到了编码和部分匹配的问题。
我有两个数据帧A和B。A通过UTF-8编码调用,B在latin1上调用。尽管我不确定,但这可能已经成为问题的一部分。这是我知道如何正确导入它的唯一方法。
编辑:我应该澄清。这只是示例数据。这两个数据框都包含大量的行和其他列。
A B
ID Name Expense Employee Category
1 Mike Adall 3 Lothar Fiend B2
2 Brian Adams 4 Rohan Sudarsh A2
3 Adrián 1 Adrián Silva A1
4 Floyd Oid 1 Semi Ajayi A1
5 Semi Ajayi 4 Micheal Adall A1
6 Jomu Aké 3 Jomü Ria Aké B1
Brian Adams B2
Floyd Öid Matheus B1
我一直在尝试提取B $ Employee $并将它们与A $ Name进行部分匹配,以创建一个包含B $ Category的新dfC。这是我想要的输出。
编辑:在“类别”中,我还要包含A和B的所有其他列,但不包括Employee。
C
ID Name Expense Category
1 Mike Adall 3 A1
2 Brian Adams 4 B2
3 Adrián 1 A1
4 Floyd Oid 1 B1
5 Semi Ajayi 4 A1
6 Jomu Aké 3 B1
到目前为止,我已经使用Fuzzyjoin软件包匹配了80%的字符。
C <- A %>% fuzzy_inner_join(B,by = c(Name = "Employee"))
主要问题似乎是这些奇怪的拉丁字符,例如Ö,ß等,或者有时出现在诸如“Aké”之类的名称的末尾。结果似乎因名称而异。
如何才能部分匹配所有名称?
解决方法
此方法只会导致一个匹配项(列match
),因为即使存在距离限制,which.min
和max.col
的长度也是1。
检查手动关系很重要。可以在data.frame res
的列minMatchSeveral
或下面的第二个脚本中检查领带。
require(stringdist)
{
firstvector <-A$Name
secondvector<-B$Employee
threshold <- 14 # max 14 characters of divergence
lenMin<-mindist<-integer()
match <- minMatchSeveral <- sortedmatches <- character()
for (i in 1:length(firstvector) ) {
matchdist <- stringdist::stringdist(firstvector[i],secondvector,"lcs") # several methods available
matchdist <- ifelse(matchdist>threshold,NA,matchdist)
sortedmatches[i] <- paste(secondvector[order(matchdist,na.last=NA)],collapse = ",")
mindist[i]<- tryCatch(ifelse(is.integer(which.min(matchdist)),matchdist[which.min(matchdist)],NA),error = function(e){NA})
lenMin[i] <- tryCatch(length(matchdist[which(matchdist == min(matchdist,na.rm=T) ) ]),warning = function(w){""} )
match[i]<-ifelse(length(secondvector[which.min(matchdist)])==0,secondvector[which.min(matchdist)] )
minMatchSeveral[i] <- ifelse(lenMin[i]>1,suppressWarnings(ifelse(length(secondvector[which(matchdist==min(matchdist,na.rm=T) ) ] )==0,paste(secondvector[ which(matchdist==min(matchdist,na.rm=T) ) ]," )
)),NA)
}
res<-data.frame(firstvector=firstvector,match=match,divergence=mindist,lenMin= lenMin,minMatchSeveral = minMatchSeveral,sortedmatches=sortedmatches,stringsAsFactors = F)
}
res
firstvector match divergence lenMin minMatchSeveral sortedmatches
1 Mike Adall Micheal Adall 5 2 Micheal Adall,Micheol Adall Micheal Adall,Micheol Adall,Brian Adams,Semi Ajayi
2 Brian Adams Brian Adams 0 1 <NA> Brian Adams,Rohan Sudarsh,Micheal Adall,Adrián Silva,Semi Ajayi,Micheol Adall
3 Adrian Adrián Silva 8 1 <NA> Adrián Silva,Lothar Fiend,Jomü Ria Aké
4 Floyd Oid Floyd Öid Matheus 10 1 <NA> Floyd Öid Matheus,Lothar Fiend
5 Semi Ajayi Semi Ajayi 0 1 <NA> Semi Ajayi,Jomü Ria Aké
6 Jomu Aké Jomü Ria Aké 6 1 <NA> Jomü Ria Aké,Semi Ajayi
A$match<-match
# For large tables,consider using data.table::merge
C <- merge(A,B,by.x="match",by.y = "Employee",all.x=T)
C[,2:ncol(C)]
ID Name Expense Category
1 3 Adrián 1 A1
2 2 Brian Adams 4 B2
3 4 Floyd Oid 1 B1
4 6 Jomu Aké 3 B1
5 1 Mike Adall 3 A1
6 5 Semi Ajayi 4 A1
从?stringdist-metrics
最长的公共子字符串(method ='lcs')被定义为最长的 可以通过配对来自a和b的字符而获得的字符串 保持字符顺序完整。 lcs距离定义为 未配对字符的数量。距离等于 编辑距离仅允许删除和插入,每个都有权重 一个。
此外,您还可以查看stringi::stri_trans_general
编辑:可视化联系的另一种方式
{
mm <- -t(sapply(A$Name,stringdist::stringdist,B$Employee,"lcs"))
idx <- mm[cbind(seq_along(max.col(mm)),max.col(mm))]
ties <-sapply(seq_along(mm[,1]),function(x) which(mm[x,] %in% idx[x]) )
list <-sapply(ties,function(x) paste(B[x,] ),simplify=F)
my<-as.matrix(do.call("rbind",list) )
dimnames( my)[[2]] <- c("closestMatch","Category")
cbind(A,my )
}
ID Name Expense closestMatch Category
1 1 Mike Adall 3 c("Micheal Adall","Micheol Adall") c("A1","A1")
2 2 Brian Adams 4 Brian Adams B2
3 3 Adrian 1 Adrián Silva A1
4 4 Floyd Oid 1 Floyd Öid Matheus B1
5 5 Semi Ajayi 4 Semi Ajayi A1
6 6 Jomu Aké 3 Jomü Ria Aké B1
数据
{
A<-read.table(text="ID Name Expense
1 \"Mike Adall\" 3
2 \"Brian Adams\" 4
3 \"Adrian\" 1
4 \"Floyd Oid\" 1
5 \"Semi Ajayi\" 4
6 \"Jomu Aké\" 3 ",header=T,stringsAsFactors = F)
B<-read.table(text="Employee Category
\"Lothar Fiend\" B2
\"Rohan Sudarsh\" A2
\"Adrián Silva\" A1
\"Semi Ajayi\" A1
\"Micheal Adall\" A1
\"Micheol Adall\" A1 # testing ties
\"Jomü Ria Aké\" B1
\"Brian Adams\" B2
\"Floyd Öid Matheus\" B1",stringsAsFactors = F)
}
,
在基数R中,您可以如下使用agrep
和adist
:
d<-sapply(A$Name,agrep,B$Employee)
d[e]<-max.col(-adist(e<-names(Filter(Negate(length),d)),B$Employee))
cbind(A,B[unlist(d),])
ID Name Expense Employee Category
5 1 Mike Adall 3 Micheal Adall A1
7 2 Brian Adams 4 Brian Adams B2
3 3 Adrián 1 Adrián Silva A1
8 4 Floyd Oid 1 Floyd Öid Matheus B1
4 5 Semi Ajayi 4 Semi Ajayi A1
6 6 Jomu Aké 3 Jomü Ria Aké B1
编辑:
使用stringdist
软件包:您可以这样做:
cbind(A,B[max.col(-t(sapply(A$Name,"lcs"))),])
ID Name Expense Employee Category
5 1 Mike Adall 3 Micheal Adall A1
7 2 Brian Adams 4 Brian Adams B2
3 3 Adrián 1 Adrián Silva A1
8 4 Floyd Oid 1 Floyd Öid Matheus B1
4 5 Semi Ajayi 4 Semi Ajayi A1
6 6 Jomu Aké 3 Jomü Ria Aké B1