将两个数据帧分组并使用 stringdist

问题描述

我想使用 stringdist_join 按十年对美国县进行模糊匹配。由于县名随时间变化，我希望每十年匹配正确的县名。

如果我写：

stringdist_join(mispelled,correct,by=c('decade','county'))

然后 stringdist_join 也会对十年进行模糊匹配，例如匹配1960 年到 1970 年，实际上我想将十年变量视为正确的，并且仅对县进行模糊匹配。

我可以看到我需要按十年对拼写错误和正确的数据帧进行分组，然后分别在每个数据帧上运行，但我不知道如何执行此操作。 Tidyverse 解决方案将是首选。

谢谢！

解决方法

最终，我认为您寻求的方法是让 max_dist 成为距离向量，您可以在其中执行 stringdist_inner_join(...,max_dist=c(0,2))。不幸的是，虽然有人提出了要求（在 2017 年：https://github.com/dgrtwo/fuzzyjoin/issues/36 和 https://github.com/dgrtwo/fuzzyjoin/issues/21），但似乎尚未实施。

如果您能负担得起更大的中间连接产品，则解决方法是允许它，然后过滤掉 decade 是不精确连接的位置。

缺乏数据，我将使用 ggplot2::diamonds 进行演示。在这里，我需要 stringdist 的正常 cut 功能和 clarity 的完全匹配。

d <- data.frame(cut = c("Idea","Premiums","Premioom","VeryGood","Faiir"),clarity = rep(c("SI1","SI2"),3),type = 1:6)
data("diamonds",package = "ggplot2")
diamonds <- diamonds[1:10,]

joined <- stringdist_inner_join(diamonds,d,by = c("cut","clarity"))
joined
# # A tibble: 8 x 13
#   carat cut.x     color clarity.x depth table price     x     y     z cut.y    clarity.y  type
#   <dbl> <ord>     <ord> <ord>     <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>    <chr>     <int>
# 1 0.23  Ideal     E     SI2        61.5    55   326  3.95  3.98  2.43 Idea     SI1           1
# 2 0.21  Premium   E     SI1        59.8    61   326  3.89  3.84  2.31 Premiums SI2           2
# 3 0.21  Premium   E     SI1        59.8    61   326  3.89  3.84  2.31 Premioom SI1           3
# 4 0.290 Premium   I     VS2        62.4    58   334  4.2   4.23  2.63 Premiums SI2           2
# 5 0.26  Very Good H     SI1        61.9    55   337  4.07  4.11  2.53 VeryGood SI2           4
# 6 0.26  Very Good H     SI1        61.9    55   337  4.07  4.11  2.53 VeryGood SI1           5
# 7 0.22  Fair      E     VS2        65.1    61   337  3.87  3.78  2.49 Faiir    SI2           6
# 8 0.23  Very Good H     VS1        59.4    61   338  4     4.05  2.39 VeryGood SI1           5

subset(joined,clarity.x == clarity.y)
# # A tibble: 2 x 13
#   carat cut.x     color clarity.x depth table price     x     y     z cut.y    clarity.y  type
#   <dbl> <ord>     <ord> <ord>     <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>    <chr>     <int>
# 1  0.21 Premium   E     SI1        59.8    61   326  3.89  3.84  2.31 Premioom SI1           3
# 2  0.26 Very Good H     SI1        61.9    55   337  4.07  4.11  2.53 VeryGood SI1           5

fuzzyjoin r r

将两个数据帧分组并使用 stringdist_join 循环

问题描述

解决方法