字符串相似性分组与匿名数据

问题描述

我处理的是匿名数据，其中目的地的拼写可能不正确（我只观察到了目的地和来源的匿名密钥，但我知道来源是正确的。）

origin<-c("Norway","Norway","Sweden","Sweden")
destination_typed<-c("Germany","Gerrmany","Spain","Spaiin")
df<-data.frame(origin=origin,destination=destination_typed)
df

我也有关于目的地字符串相似性的数据。同样，我仅观察到国家/地区的匿名密钥和分数（它们的相似程度）。因此，我不知道正确的拼写是什么，即只要将它们分组（dest_key_for_spain），我对西班牙和西班牙一样满意。

library(dplyr)
df_names<-expand.grid(destination_typed=destination_typed,destination_alternatives=destination_typed,stringsAsFactors = F) %>% 
  arrange(destination_typed) %>% 
  mutate(similarity_score=stringdist::stringsim(destination_typed,destination_alternatives))
df_names

我想要的是将匿名目的地分组在一起（例如，如果相似度得分> 0.5），即：

df_wanted<-data.frame(origin=c("Norway","Sweden"),destination=c("dest_key_for_germany","dest_key_for_spain"))
df_wanted

更新：由于我实际上有匿名数据，因此数据实际上看起来像这样：

# using anonymized data:
df$destination[df$destination=="Germany"]<-"###123A"
df$destination[df$destination=="Gerrmany"]<-"#KL237#"
df_names$destination_typed[df_names$destination_typed=="Germany"]<-"###123A"
df_names$destination_typed[df_names$destination_typed=="Gerrmany"]<-"#KL237#"
df_names$destination_alternatives[df_names$destination_alternatives=="Germany"]<-"###123A"
df_names$destination_alternatives[df_names$destination_alternatives=="Gerrmany"]<-"#KL237#"
df$destination[df$destination=="Spain"]<-"##957KA"
df$destination[df$destination=="Spaiin"]<-"KLU##ab"
df_names$destination_typed[df_names$destination_typed=="Spain"]<-"##957KA"
df_names$destination_typed[df_names$destination_typed=="Spaiin"]<-"KLU##ab"
df_names$destination_alternatives[df_names$destination_alternatives=="Spain"]<-"##957KA"
df_names$destination_alternatives[df_names$destination_alternatives=="Spaiin"]<-"KLU##ab"

df
df_names

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

grouping r r