ML：如何确保标签在正确的位置？太好了，不能成为真正的类回归结果

问题描述

我使用 1100 行 84 列的市场数据。

https://archive.ics.uci.edu/ml/machine-learning-databases/00554/

数据清理相对较新，所以我认为我弄乱了标签位置？这个想法是获得滞后 n+1 个索引计数的对数回报，并将其标记为 y={0,1}，其中 0 是下跌日，1 是上涨日。

我做了 2 次 dropna，一次是在开始时清理初始 NaN 值，一次是在日志返回计算之后，因为第一行得到 NaN。如何确保标签位于正确的位置？

# Compare
from sklearn.model_selection import cross_val_score

for dataset_name,dataset in [('S&P500',data_spx1),('RUSSELL',data_russ1),('NASDAQ',data_ndaq1),('NYSE',data_nyse1),('DJIA',data_dow1)]:
    
    X = dataset.loc[:,dataset.columns != 'Class']
    y = dataset['Class']
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)
    for name,model in class_models:
        mod = model
        mod.fit(X_train,y_train)
        accuracy = mod.score(X_test,y_test)
        n = 5
        kfold = model_selection.KFold(n_splits = n,shuffle = True,random_state = None)
        cv_result = cross_val_score(mod,X,cv = kfold,scoring = 'accuracy')
        print(dataset_name,name,'test_accuracy =',accuracy)
        print('CV random: ',cv_result)
        print('CV avg.: ',np.sum(cv_result) / n)
    print("")

平均测试集结果：NaiveBayes 0.80、SVM 0.55、KNN 0.56、DecisionTree 0.9、RandomForest 0.98

S&P500 NaiveBayes test_accuracy = 0.8293413173652695
CV avg.:  0.8220538924574798
S&P500 SVM test_accuracy = 0.5778443113772455
CV avg.:  0.5570233911041086
S&P500 KNN test_accuracy = 0.5718562874251497
CV avg.:  0.5678463216579808
S&P500 DTree test_accuracy = 0.8622754491017964
CV avg.:  0.8831979962024805
S&P500 RForest test_accuracy = 0.9311377245508982
CV avg.:  0.9272209429160101

CV shuffle 返回接近测试结果的平均值。这些结果对于我使用的数据框是否正常，我是否有重复的数据点？我尝试删除了分类 base='Return' 列，结果几乎相同。

这是 F1 分数：

# Random forest - F1 score
# data_dow1
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier

X_train,random_state = 4)
rf = RandomForestClassifier(random_state = None)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
print('Confusion matrix: \n',cm)
print('Classification report: \n',classification_report(y_test,y_pred))
sns.heatmap(cm,annot=True,fmt="d") 
plt.show()

输出：

Confusion matrix: 
 [[145   1]
 [  0 188]]
Classification report: 
               precision    recall  f1-score   support

         0.0       1.00      0.99      1.00       146
         1.0       0.99      1.00      1.00       188

    accuracy                           1.00       334
   macro avg       1.00      1.00      1.00       334
weighted avg       1.00      1.00      1.00       334

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

classification data-cleaning machine-learning python regression