相关距离度量和误差平方和

问题描述

我找不到scikit-learn的方法，无法在K-Means上使用相关距离度量-这对于我的基因表达数据集来说是必需的。

但是在搜索互联网时，我发现了一个很棒的库：biopython-可以在K-Means上使用相关距离度量。

但是，与scikit-learn不同，我无法获得惯性/平方误差之和，因此无法使用“肘形法”选择最佳K（簇）数（只有一种选择获得“错误”值，该值是“集群内的距离之和”-不成平方！）：https://biopython.org/docs/1.75/api/Bio.Cluster.html

如何同时使用：使用相关距离度量和获得SSE？

解决方法

与相关距离度量相比，平方误差的和更容易实现，因此，我建议您将biopython与以下辅助函数一起使用。它应该根据数据（假设是一个numpy数组）和biopython的clusterid输出为您计算平方误差的总和。

def SSE(data,clusterid):
    """
    Computes the sum of squared error of the data classification.
    
    Arguments:
        data: nrows x ncolumns array containing the data values.
        clusterid: array containing the number of the cluster to which each item was assigned by biopython.
    """
    
    number_of_classes = int(clusterid.max()) + 1 #Python convention: first index is 0
    
    sse = 0.0
    for i in range(number_of_classes):
        cluster = data[clusterid==i]
        sse += cluster.std(ddof=len(cluster)-1)**2
    return sse

biopython k-means metrics python scikit-learn