如何从相互满足最大字符串距离标准的大矩阵中返回字符串对列表?

问题描述

我正在尝试以一种方式呈现人工输入的单词,使它们的分组更容易识别为指同一事物。本质上是一个拼写检查器。我已经制作了一个大矩阵(实际的矩阵是 250 * 250 ish)。此矩阵的代码与下面给出的可重现示例相同。 (我已经用随机生成器填充了它,实际值更有意义但保密)

strings <- c("domineering","curl","axiomatic","root","gratis","secretary","lopsided","cumbersome","oval","mighty","thaw","troubled","furniture","round","soak","callous","melted","wealthy","sweltering","verdant","fence","eyes","ugliest","card","quickest","harm","brake","alarm","report","glue","hollow","quince","pack","twig","knot")

matrix <- stringdistmatrix(strings,strings,useNames = TRUE)

现在我想创建一个包含两个变量的新表,第一列必须包含满足字符串距离小于某个数字的条件的“字符串”元素对,例如(stringdist

我有一种感觉,这将需要某种 apply 函数,但我不知道。

干杯。

解决方法

以下基于 tidyverse 的解决方案应该可以解决问题。

请注意,最后一行是为了方便查看结果。我认为这对您的目的没有必要。如果您确实想保留它,我建议您将其纳入“配对”的初始制作中。

library(stringdist)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(stringr)

matrix %>%
  as_tibble() %>%
  mutate(X = colnames(.),.before = 1) %>%
  pivot_longer(-X) %>%
  filter(value %in% 1:7) %>%
  transmute(pair = map2(X,name,~ sort(c(.x,.y))),stringDist = value) %>%
  distinct(pair,stringDist) %>%
  mutate(pair = map_chr(pair,~ str_c(.,collapse = '_')))

# A tibble: 451 x 2
#   pair                   stringDist
#   <chr>                       <dbl>
# 1 domineering_sweltering          6
# 2 curl_root                       4
# 3 curl_gratis                     6
# 4 curl_secretary                  7
# 5 cumbersome_curl                 7
# 6 curl_oval                       3
# 7 curl_mighty                     6
# 8 curl_thaw                       4
# 9 curl_troubled                   6
# 10 curl_furniture                 7
,

这也行

matrix[lower.tri(matrix)] <- 0
matrix_melt <- melt(matrix)
matrix_melt %>%
    filter(value %in% 1:7)