Sklearn 隔离森林中污染设置与异常值预测数量之间的不一致

问题描述

我受到了这个 notebook 的启发，我正在试验 IsolationForest 算法，在 {{3} 的 SF 版本上使用 scikit-learn==0.22.2.post1 进行异常检测上下文}，包括4个属性。数据直接从 sklearn 中获取，经过预处理（对分类特征进行标签编码）后传递给具有默认设置的 IF 算法。

完整代码如下：

from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score,roc_curve,roc_auc_score,f1_score,precision_recall_curve,auc
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import matplotlib.pyplot as plt
import datetime

%matplotlib inline


def byte_decoder(val):
    # decodes byte literals to strings
    
    return val.decode('utf-8')

#Load Dataset KDDCUP99 from sklearn
target = 'target'
sf = datasets.fetch_kddcup99(subset='SF',percent10=False) # you can use percent10=True for convenience sake
dfSF=pd.DataFrame(sf.data,columns=["duration","service","src_bytes","dst_bytes"])
assert len(dfSF)>0,"SF dataset no loaded."

dfSF[target]=sf.target
anomaly_rateSF = 1.0 - len(dfSF.loc[dfSF[target]==b'normal.'])/len(dfSF)

"SF Anomaly Rate is:"+"{:.1%}".format(anomaly_rateSF)
#'SF Anomaly Rate is: 0.5%'

#Data Processing 
toDecodeSF = ['service']
# apply hot encoding to fields of type string
# convert all abnormal target types to a single anomaly class

dfSF['binary_target'] = [1 if x==b'normal.' else -1 for x in dfSF[target]]
    
leSF = preprocessing.LabelEncoder()

for f in toDecodeSF:
    dfSF[f + " (encoded)"] = list(map(byte_decoder,dfSF[f]))
    dfSF[f + " (encoded)"] = leSF.fit_transform(dfSF[f])

for f in toDecodeSF:
  dfSF.drop(f,axis=1,inplace=True)

dfSF.drop(target,inplace=True)

#check rate of Anomaly for setting contamination parameter in IF
dfSF["binary_target"].value_counts() / np.sum(dfSF["binary_target"].value_counts())



#data split
X_train_sf,X_test_sf,y_train_sf,y_test_sf = train_test_split(dfSF.drop('binary_target',axis=1),dfSF['binary_target'],test_size=0.33,random_state=11,stratify=dfSF['binary_target'])

#print(y_test_sf.value_counts())
#1       230899
#-1      1114
#Name: binary_target,dtype: int64

#training IF and predict the outliers/anomalies on test set with 10% contamination:
clfIF = IsolationForest(max_samples="auto",contamination = 0.1,n_estimators=100,n_jobs=-1)

clfIF.fit(X_train_sf,y_train_sf)
y_pred_test = clfIF.predict(X_test_sf)

#print(X_test_sf.shape)
#(232013,4)

#print(np.unique(y_pred_test,return_counts=True))
#(array([-1,1]),array([ 23248,208765])) # instead of labeling 10% of 232013,which is 23201 data outliers/anomalies,It is 23248 !!

基于二元情况下的KDDCUP99 dataset，我们可以提取真阳性等，如下所示：

tn,fp,fn,tp = confusion_matrix(y_test_sf,y_pred_test).ravel()
print("TN: ",tn,"FP: ","FN: ","TP: ",tp)
#TN:  1089 FP:  25 FN:  22159 TP:  208740

问题：

问题 1： 我想知道为什么 IF 预测超过 10% 的污染已经通过标记异常值/异常值在测试集上设置了？ 23248 而不是 23201 !!
问题 2： 通常 TN + FP 应该是内点/正常 230899 并且 FN + TP 应该等于我们计算的 1114数据拆分后。我认为在我的实现中反之亦然，但我无法弄清楚并调试它。
问题 3：基于 KDDCUP99 数据集 documentation 及其用户指南和我在以下实现中的计算，异常率为 0.5% 和这意味着如果我设置 contamination=0.005，它应该给我

可能我在这里遗漏了一些东西，任何帮助将不胜感激。

解决方法

事实是，当评分数据点应被视为异常值时，污染参数只是控制决策函数的阈值。它对模型本身没有影响。使用一些统计分析来粗略估计污染可能是有意义的。

如果您希望数据集中有一定数量的异常值，那么您可以使用原始分数找到一个阈值，该阈值为您提供该数字并在将模型应用于新的模型时追溯设置污染参数数据。

anomaly-detection isolation-forest machine-learning python scikit-learn