问题描述

我为您准备了以下测试代码：

####TESTING HERE
test = tibble::tribble(
                          ~Name1,~Name2,~Name3,"Paul Walker","Paule Walkr","Heiko Knaup","Ferdinand Bass","Ferdinand Base","Michael Herre"
                )

library(stringdist)
output <- list()
for (row in 1:nrow(test)) 
{
  codephon = phonetic(test[row,],method = c("soundex"),useBytes = FALSE)
  output[[row]] <- codephon
}

#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix,",3)
#> Error in str_split_fixed(phoneticmatrix,3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]','',soundexspalten)
#> Error in gsub("[()c\"]","",soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000",soundexmatrix0)
#> Error in gsub("0000",soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden

^{由 reprex package (v2.0.0) 于 2021 年 6 月 3 日创建}

现在我想!!!用字符串“DUPLICATE”替换 soundexmatrix1 中的所有重复项，以便矩阵的维度保持不变，并且可以立即看到所有重复项。

任何想法如何做到这一点？感谢您的帮助！

解决方法

要检查每一行中的重复项（请参阅更新），这应该能以更简洁的方式实现您的目标：

scheduleTask

# Feel free to load the packages you're using. # library(stringdist) # library(tibble) test <- tibble::tribble( ~Name1,~Name2,~Name3,"Paul Walker","Paule Walkr","Heiko Knaup","Ferdinand Bass","Ferdinand Base","Michael Herre" ) # Get phonetic codes cleanly. result <- as.matrix(apply(X = test,MARGIN = 2,FUN = stringdist::phonetic,method = c("soundex"),useBytes = FALSE)) # Find all blank codes ("0000"). blanks <- result == "0000" # # Find all duplicates,as compared across ENTIRE matrix; ignore blank codes. # all_duplicates <- !blanks & duplicated(result,MARGIN = 0) # Find duplicates,as compared within EACH ROW; ignore blank codes. row_duplicates <- !blanks & t(apply(X = result,MARGIN = 1,FUN = duplicated)) # Replace blank codes ("0000") with blanks (""); and replace duplicates (found # within rows) with "DUPLICATE". result[blanks] <- "" result[row_duplicates] <- "DUPLICATE" # View result. result 应该是以下矩阵：

result

更新

根据海报的 request，我更改了代码以仅在每一行内比较重复项，而不是在整个 Name1 Name2 Name3 [1,] "P442" "DUPLICATE" "H225" [2,] "F635" "DUPLICATE" "M246" 矩阵中进行比较。现在，result 数据集如

test

会给一个test <- tibble::tribble( ~Name1,"Michael Herre","","01234 56789","Heiko Knaup" # | ^^ | ^^^^^^^^^^^^^ | ^^^^^^^^^^^^^ | # | Coded as "0000" | Coded as "0000" | Duplicate in matrix,NOT in row | )赞

result

duplicates matrix matrix r r stringdist

替换矩阵中的重复项

问题描述

解决方法

更新