计算偏差时如何保存集群分配并防止它们在下一次迭代中被覆盖？

问题描述

我正在实施一种算法，该算法计算每个集群的偏差，然后将具有最高偏差的集群拆分为新的集群。最终，我想找到具有最高偏差的集群，这意味着分类器要么在这些实例上产生更多错误，要么产生更少错误。

这是算法：

从整个数据集作为一个集群开始
用 KMeans 分成两个集群
计算每个集群的宏 F1 分数
计算这两个集群的偏差。偏见是： F1-score_cluster_k - F1 对不包括集群 k 的所有集群进行评分
如果 Max(bias_cluster_i,bias_cluster_j) >=bias_prevIoUs_cluster：将集群 cluster_i 和 cluster_j 添加到列表中并删除之前的集群
从具有最高误差度量标准偏差的 cluster_list 中继续进行聚类。
使用 KMeans 将该集群拆分为 2 个集群并继续执行步骤 3

为了使这个算法起作用，我需要保存之前迭代中的聚类分配和 F 分数，以便能够在当前迭代中比较它们（第 5 步）。

我的一个解决方案是将 Pandas DF 中的集群分配保存为一个新列，然后将此列与新的集群分配进行比较，但是否有更好的方法来防止这些集群分配被覆盖？

这是我的代码：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine

data = load_wine()
df_data = pd.DataFrame(data.data,columns=data.feature_names)
df_target = pd.DataFrame(data = data.target)

# Merging the datasets into one dataframe
all_data = df_data.merge(df_target,left_index=True,right_index=True)
all_data.rename( columns={0 :'target_class'},inplace=True )
all_data.head()

# Dividing X and y into train and test data (small train data to gain more errors)
X_train,X_test,y_train,y_test = train_test_split(df_data,df_target,test_size=0.60,random_state=2)

# Training a RandomForest Classifier 
model = RandomForestClassifier()
model.fit(X_train,y_train.values.ravel())

# Obtaining predictions
y_hat = model.predict(X_test)

# Converting y_hat from Np to DF
predictions_col = pd.DataFrame()
predictions_col['predicted_class'] = y_hat.tolist()
predictions_col['true_class'] = y_test

# Calculating the errors with the absolute value 
predictions_col['errors'] = abs(predictions_col['predicted_class'] - predictions_col['true_class'])

# It doesn't matter whether the misclassification is between class 0 and 2 or between 0 and 1,it has the same error value. 
predictions_col['errors'] = predictions_col['errors'].replace(2.0,1.0)

# Adding predictions to test data
df_out = pd.merge(X_test,predictions_col,left_index = True,right_index = True)

# Scaling the features
scaled_matrix = StandardScaler().fit_transform(df_matrix)

# Calculating the errors of the instances in the clusters.
def F_score(results,class_number):
    true_pos = results.loc[results["true_class"] == class_number][results["predicted_class"] == class_number]
    true_neg = results.loc[results["true_class"] != class_number][results["predicted_class"] != class_number]
    false_pos = results.loc[results["true_class"] != class_number][results["predicted_class"] == class_number]
    false_neg = results.loc[results["true_class"] == class_number][results["predicted_class"] != class_number]
    
    try:
        precision =  len(true_pos)/(len(true_pos) + len(false_pos))
    except ZeroDivisionError:
        return 0
    try:
        recall = len(true_pos)/(len(true_pos) + len(false_neg))
    except ZeroDivisionError:
        return 0

    f_score = 2 * ((precision * recall)/(precision + recall))

    return f_score

# Calculating the macro average F-score
def mean_f_score(results):
    n_classes = results['true_class'].unique()
    class_list = []
    for i in range(0,n_classes-1):
        class_i = F_score(results,i)
        class_list.append(class_i)
   
    mean_f_score = (sum(class_list))/n_classes
    
    return(mean_f_score)

def calculate_bias(clustered_data,cluster_number):
    cluster_x = clustered_data.loc[clustered_data["assigned_cluster"] == cluster_number]
    remaining_clusters = clustered_data.loc[clustered_data["assigned_cluster"] != cluster_number]
    
    # Bias deFinition:
    return mean_f_score(remaining_clusters) - mean_f_score(cluster_x)

MAX_ITER = 10
cluster_comparison = []

# start with all instances in one cluster
# scaled_matrix
for i in range(1,MAX_ITER):
    kmeans_algo = KMeans(n_clusters=2,**clus_model_kwargs).fit(scaled_matrix) 
    clustered_data = pd.DataFrame(kmeans_algo.predict(scaled_matrix),columns=['assigned_cluster']) 
# Adding the assigned cluster to the column 
    # groups = pd.DataFrame(cluster_model.predict(df_data),columns=["group"])
    
    # Calculating bias per cluster
    for cluster in clustered_data:
        negative_bias_0 = calculate_bias(clustered_data,0)
        negative_bias_1 = calculate_bias(clustered_data,1)
    # the code below doesn't work
    if max(negative_bias_0,negative_bias_1) >= bias_prev_iteration:

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

cluster-analysis fairness-indicators k-means python scikit-learn