问题描述
我正在尝试以一种方式呈现人工输入的单词,使它们的分组更容易识别为指同一事物。本质上是一个拼写检查器。我已经制作了一个大矩阵(实际的矩阵是 250 * 250 ish)。此矩阵的代码与下面给出的可重现示例相同。 (我已经用随机词生成器填充了它,实际值更有意义但保密)
strings <- c("domineering","curl","axiomatic","root","gratis","secretary","lopsided","cumbersome","oval","mighty","thaw","troubled","furniture","round","soak","callous","melted","wealthy","sweltering","verdant","fence","eyes","ugliest","card","quickest","harm","brake","alarm","report","glue","hollow","quince","pack","twig","knot")
matrix <- stringdistmatrix(strings,strings,useNames = TRUE)
现在我想创建一个包含两个变量的新表,第一列必须包含满足字符串距离小于某个数字的条件的“字符串”元素对,例如(stringdist
干杯。
解决方法
以下基于 tidyverse 的解决方案应该可以解决问题。
请注意,最后一行是为了方便查看结果。我认为这对您的目的没有必要。如果您确实想保留它,我建议您将其纳入“配对”的初始制作中。
library(stringdist)
library(dplyr)
library(tibble)
library(tidyr)
library(purrr)
library(stringr)
matrix %>%
as_tibble() %>%
mutate(X = colnames(.),.before = 1) %>%
pivot_longer(-X) %>%
filter(value %in% 1:7) %>%
transmute(pair = map2(X,name,~ sort(c(.x,.y))),stringDist = value) %>%
distinct(pair,stringDist) %>%
mutate(pair = map_chr(pair,~ str_c(.,collapse = '_')))
# A tibble: 451 x 2
# pair stringDist
# <chr> <dbl>
# 1 domineering_sweltering 6
# 2 curl_root 4
# 3 curl_gratis 6
# 4 curl_secretary 7
# 5 cumbersome_curl 7
# 6 curl_oval 3
# 7 curl_mighty 6
# 8 curl_thaw 4
# 9 curl_troubled 6
# 10 curl_furniture 7
,
这也行
matrix[lower.tri(matrix)] <- 0
matrix_melt <- melt(matrix)
matrix_melt %>%
filter(value %in% 1:7)