通过编辑距离对网络中的序列进行聚类 - 在 R 中

问题描述

我有一个数据框 my_df,其中包含 10,000 个不同长度(13 到 18 之间)的不同序列,它们由不同的数字 (0-3) 组成

我的数据示例(60 行):

library(stringdist)
library(igraph)
library(reshape2)


structure(list(alfa_ch = c("2000000232003211","2000000331021","20000003310320011","20000003323331021","20000003331001","20000003332021","200000100331021","20000013011001","20000013301021","2000001333331011","20000023231031","200000233302001","20000023331011","20000023331012","20000023332021","200000233331021","20000030231011","200000303323331021","200000313301021","20000032031021","2000003220021","2000003221011","2000003231031","20000032311001","200000330330021","2000003311211","2000003331001","2000003331012","20000033321012","200000333231011","20000033323331021","20000033331021","2000010320011","20000103323331021","200001113011001","20000113011001","20000120330021","20000123033011","2000012331131","2000013011001","2000013301021","200001330231011","2000013323001","20000133231311","20000133301001","200001333331011","20000200331021","20000200331131","20000203221011","2000020333133011","20000212221111","20000213301021","2000021331011","200002223231011")),row.names = c(1L,3L,5L,6L,7L,8L,9L,10L,12L,13L,14L,16L,17L,18L,19L,20L,21L,23L,24L,27L,29L,31L,32L,33L,34L,35L,38L,41L,42L,43L,46L,47L,48L,49L,58L,59L,60L,62L,63L,64L,66L,68L,71L,72L,73L,74L,75L,77L,78L,79L,80L,81L,82L,83L,84L,85L,89L,90L,91L,95L),class = "data.frame")

,我的目标是通过编辑距离

dist_mtx=as.matrix(stringdistmatrix(my_df$alfa,my_df$alfa,method = "lv"))
dist_mtx[dist_mtx>3]=NA
dist_mtx[new_test_2==0]=NA
colnames(dist_mtx) <- dist_mtx$alfa
rownames(dist_mtx) <- dist_mtx$alfa

然后创建了一个边列表,而值代表任意2个序列之间的编辑距离:

edge_list <- unique(melt(dist_mtx,na.rm = TRUE,varnames = c('seq1','seq2'),as.is = T))
edge_list=edge_list[!is.na(edge_list$value),]

然后创建了 igraph 对象:

igraph_obj <- igraph::graph_from_data_frame(edge_list,directed = F,vertices = dist_mtx$alfa)

然后我尝试了多种方法来尝试使用 louvain 方法对这些序列进行聚类,并且我仍然得到其成员编辑距离 > 3 的聚类,我知道这可能是因为连接的组件。 所以我的问题是:

  1. 有没有办法将序列聚类在一起,以便在每个聚类中成员的编辑距离
  2. 有没有办法识别聚类中心(HUBS),尝试过 hubness.score() 并根据这些中心分配顶点并考虑编辑距离?

这是我的第一篇文章, 我将不胜感激

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...