检测虚假文本

问题描述

我有一些我认为不错的日志文件，我想训练一些东西，说这些都是不错的日志文件。

然后，我要使用新的日志文件进行测试，这些日志文件以前从未出现过，因此被检测为伪造。

我该怎么办？

我尝试将IsolationForest与CountVectoriser一起使用：

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)

X = cv.fit_transform(["hello how are you","i am fine"])
X_test = cv.fit_transform(["this is a strange sentence","another sentence here"])

print(X.toarray())
print(X_test.toarray())

from sklearn.ensemble import IsolationForest

clf = IsolationForest(random_state=0).fit(X)

clf.predict(X_test)

# array([1,1])

但是IsolationForest将文本检测为Inlier，大概是因为计数也用于有效单词。我不知道如何使IsolationForest检测到该文本“奇怪”，因为以前从未见过它。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

anomaly-detection data-science