为什么PCA会重复输出一些组件?

问题描述

我正在研究 CTU-13 数据集,您可以在数据集 here 中查看其分布概览。我正在使用 CTU-13 数据集的第 11 个场景 (S11.csv),您可以访问 here

关于数据集的合成性质,我需要了解特征工程阶段最重要的特征。

#dataset loading
df = pd.read_csv('/content/drive/My Drive/s11.csv')
#Keep events/rows which have 'normal' or 'Bot' 
df  = df.loc[(df['Label'].str.contains('normal') == True) | (df['Label'].str.contains('Bot') == True)]
#binary labeling 
df.loc[(df['Label'].str.contains('normal') == True),'Label'] = 0
df.loc[(df['Label'].str.contains('Bot') == True),'Label'] = 1

#data cleaning
null_columns = df.columns[df.isnull().any()]
#omit columns have more than 70% missing values
for i in null_columns:
  B = df[i].isnull().sum()
  if B > (df.shape[0]*70)//100:
    del df[i]

name_columns = list(df.columns)
for i in name_columns:
  if df[i].dtype == object:
    df[i] = pd.factorize(df[i])[0]+1 

#impute mean of each column for missing values
name_columns = list(df.columns)
for i in name_columns:
  mean1 = df[i].mean()
  df[i] = df[i].replace(np.nan,mean1)

#Apply PCA
arr = df.to_numpy()
arr=arr[:,:-1]
pca=PCA(n_components=10)
x_pca=pca.fit_transform(arr)
explain=pca.explained_variance_ratio_

#sort and index pca top 10
n_pcs= pca.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = []

for col in df.columns:
    initial_feature_names.append(col)

# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
print('important column by order: ')
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
top_components = pd.DataFrame(dic.items())
print(top_components)

问题:我想知道为什么 PCA 的输出会重复一些组件?!

important column by order: 
     0         1
0  PC0  TotBytes
1  PC1  SrcBytes
2  PC2      Load
3  PC3       Seq
4  PC4   DstLoad
5  PC5   DstLoad
6  PC6     Sport
7  PC7      Load
8  PC8      Rate
9  PC9      Rate

对调试此问题的任何帮助将不胜感激!可能我在实现中遗漏了一些东西。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)