R中kmeans中心的层次聚类

问题描述

我有一个庞大的数据集(200,000 rows * 40 columns)，其中每一行代表一个观察值，每一列都是一个变量。对于此数据，我想做hierarchical clustering。不幸的是，由于行数巨大，所以无法使用我的计算机来执行此操作，因为我需要计算所有观测对对的距离矩阵，即(200,000 * 200,000)矩阵。

此question的答案建议先使用kmeans计算多个中心，然后使用库{{1}在这些中心的坐标上执行hierarchical clustering }。

问题：应用相同的方法时，我总是收到错误消息！

#example

FactomineR

但是

# Data
MyData <- rbind(matrix(rnorm(70000,sd = 0.3),ncol = 2),matrix(rnorm(70000,mean = 1,ncol = 2))
colnames(x) <- c("x","y")

kClust_MyData <- kmeans(MyData,1000,iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers,graph=FALSE,nb.clust=-1)
plot.HCPC(Hclust_MyData,choice="tree")

解决方法

包fastcluster的方法 hclust.vector 不需要距离矩阵作为输入，但是可以以内存效率更高的方式自己计算距离。从fastcluster手册中：

通话
hclust.vector(X,method='single',metric=[...])
等价于
hclust(dist(X,metric=[...]),method='single')
但使用更少的内存并且速度同样快