在R中删除具有相似不相同字符串的行

问题描述

我有大量的word文件作为文本（每个报告在一个单元格中）导入到r中，每个文件都有一个ID。

然后我使用dplyr中的distinct函数来删除重复的函数。

但是，有些报告是完全相同的，但是差别很小（例如，额外的/更少的单词，多余的空格等），因此dplyr并未将其视为重复项。有没有一种有效的方法来删除r中的“高度相似”项目？

这将创建一个示例数据集（非常简化地映射到我正在处理的原始数据：

d = structure(list(ID = 1:8,text = c("The properties of plastics depend on the chemical composition of the subunits,the arrangement of these subunits,and the processing method.","Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.","All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined,it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.","The properties of plastics depend on the chemical composition of the subunits,"All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined,"all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined,it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)),class = "data.frame",row.names = c(NA,-8L))

这是dplyr代码，用于删除确切的重复项。但是，您会注意到项目2、7和8几乎相同

library(dplyr)

d %>% 
  distinct(text,.keep_all = T) %>% 
  View()

看起来dplyr中有一个like函数，但是我可以在这里找到如何正确应用它的方法（它似乎也只适用于短字符串，例如单词）dplyr filter() with SQL-like %wildcard%

还提供了一个软件包tidystringdist，该软件包可以计算2个字符串的相似程度，但是在此处找不到将其应用于相似但不相同的项的方法。 https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

这时有什么建议或指导吗？

更新：

看起来软件包stringdist可以按照以下用户的建议解决。

rstudio网站上的这个问题也处理了类似的问题，尽管所需的输出有些不同。我将他们的代码应用于我的数据，并能够识别出类似的代码。 https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2

library(tidystringdist)
library(tidyverse)

# First remove any duplicates: 
d =d %>% 
  distinct(text,.keep_all = T) %>% 
  View()

# this will identify the similar ones and place then in one dataframe called match: 
match <- d %>% 
  tidy_comb_all(text) %>% 
  tidy_stringdist() %>% 
  filter(soundex == 0) %>% # Set a threshold
  gather(x,match,starts_with("V")) %>% 
  .$match

# create negate function of %in%:

 `%!in%` = Negate(`%in%`)

# this will remove those in the `match` out of `d` :
d2 = d %>% 
  filter(text %!in% match) %>% 
  arrange(text)

使用上面的代码，d2根本没有任何重复项/相似的重复项，但我想保留其中一份。

关于如何保留一份副本的任何想法（例如仅第一次出现）？

解决方法

library(stringdist)


dd <- d[ !duplicated( d[['test']] ),]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits,the arrangement of these subunits,and the processing method."                                                                                                                                                                              
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."                                                                                                                                                                                                          
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined,it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."    
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined,it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains." 
[5] "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined,it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."

unname( sapply(dd,stringdist,dd,method="dl") )
#------------------
     [,1] [,2] [,3] [,4] [,5]
[1,]    0  105  231  235  235
[2,]  105    0  234  238  238
[3,]  231  234    0   10    5
[4,]  235  238   10    0   13
[5,]  235  238    5   13    0

距离是相对于琴弦长度的，因此较短的琴弦具有较大的最大距离，但是对于这种情况，看起来上限为20就足够了。适当的解决方案将使用“距离”与该矢量元素的nchar的比率。

不作为最终解决方案提供，而是作为第4步中的第1步和第2步提供。

我相信您正在寻找此软件包：fuzzyjoin。

提供了许多模糊距离函数，但是如果模糊距离很小，则基本上有两个条目是“相似的”。

dplyr dplyr duplicates filter filter r r similarity