问题描述
我为您准备了以下测试代码:
####TESTING HERE
test = tibble::tribble(
~Name1,~Name2,~Name3,"Paul Walker","Paule Walkr","Heiko Knaup","Ferdinand Bass","Ferdinand Base","Michael Herre"
)
library(stringdist)
output <- list()
for (row in 1:nrow(test))
{
codephon = phonetic(test[row,],method = c("soundex"),useBytes = FALSE)
output[[row]] <- codephon
}
#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix,",3)
#> Error in str_split_fixed(phoneticmatrix,3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]','',soundexspalten)
#> Error in gsub("[()c\"]","",soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000",soundexmatrix0)
#> Error in gsub("0000",soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden
由 reprex package (v2.0.0) 于 2021 年 6 月 3 日创建
现在我想!!!用字符串“DUPLICATE”替换 soundexmatrix1 中的所有重复项,以便矩阵的维度保持不变,并且可以立即看到所有重复项。
任何想法如何做到这一点? 感谢您的帮助!
解决方法
要检查每一行中的重复项(请参阅更新),这应该能以更简洁的方式实现您的目标:
scheduleTask
# Feel free to load the packages you're using.
# library(stringdist)
# library(tibble)
test <- tibble::tribble(
~Name1,~Name2,~Name3,"Paul Walker","Paule Walkr","Heiko Knaup","Ferdinand Bass","Ferdinand Base","Michael Herre"
)
# Get phonetic codes cleanly.
result <- as.matrix(apply(X = test,MARGIN = 2,FUN = stringdist::phonetic,method = c("soundex"),useBytes = FALSE))
# Find all blank codes ("0000").
blanks <- result == "0000"
# # Find all duplicates,as compared across ENTIRE matrix; ignore blank codes.
# all_duplicates <- !blanks & duplicated(result,MARGIN = 0)
# Find duplicates,as compared within EACH ROW; ignore blank codes.
row_duplicates <- !blanks & t(apply(X = result,MARGIN = 1,FUN = duplicated))
# Replace blank codes ("0000") with blanks (""); and replace duplicates (found
# within rows) with "DUPLICATE".
result[blanks] <- ""
result[row_duplicates] <- "DUPLICATE"
# View result.
result
应该是以下矩阵:
result
更新
根据海报的 request,我更改了代码以仅在每一行内比较重复项,而不是在整个 Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
矩阵中进行比较。现在,result
数据集如
test
会给一个test <- tibble::tribble(
~Name1,"Michael Herre","","01234 56789","Heiko Knaup"
# | ^^ | ^^^^^^^^^^^^^ | ^^^^^^^^^^^^^ |
# | Coded as "0000" | Coded as "0000" | Duplicate in matrix,NOT in row |
)
赞
result