值错误 X 有 24 个特征，但 DecisionTreeClassifier 期望 19 个特征作为输入

问题描述

我正在尝试在我的机器上使用拓扑数据分析 (TDA) 重现 this GitHub 项目。

我的步骤：

从交叉验证输出中获取最佳参数
加载我的数据集特征选择
从数据集中提取拓扑特征进行预测
创建一个基于最佳参数的随机森林分类器模型
计算测试数据的概率

背景：

功能选择

为了决定哪些属性属于哪个组，我们创建了一个相关矩阵。由此，我们看到有两个大的群体，其中玩家属性相互之间具有很强的相关性。因此，我们决定将属性分为两组，一个总结球员的进攻特点，另一个总结防守。最后，由于守门员的统计数据完全不同其他玩家，我们决定只考虑整体评分。下面，是可能的查看每个玩家使用的 24 个功能：

攻击：“定位”、“穿越”、“终结”、“heading_accuracy”、“short_passing”、 “反应”，“截击”，“运球”，“曲线”，“free_kick_accuracy”，“加速度”， "sprint_speed","agility","penalties","vision","shot_power","long_shots" 防守：“拦截”、“侵略”、“盯防”、“站立铲球”、“滑动铲球”、 “长传” 守门员：“overall_rating”

从这组特征中，我们下一步要做的是，对于每个非守门员球员，计算攻击属性和防御属性的均值。

最后，对于给定比赛中的每支球队，我们计算均值和标准差从球队球员的这些数据来看，他们的进攻和防守，以及最好的进攻和最佳防守。

通过这种方式，一场比赛由 14 个特征（GK 总体价值、最佳攻击、标准攻击、平均进攻，最佳防守，标准防守，平均防守），映射了空间中的比赛，遵循两支球队的特点。

特征提取

TDA 的目的是捕捉数据底层空间的结构。在我们的项目中，我们假设数据点的邻域隐藏了与比赛结果相关的有意义的信息。因此，我们探索了数据空间寻找这种相关性。

方法：

def get_best_params():
    cv_output = read_pickle('cv_output.pickle')
    best_model_params,top_feat_params,top_model_feat_params,*_ = cv_output

    return top_feat_params,top_model_feat_params

def load_dataset():
    x_y = get_dataset(42188).get_data(dataset_format='array')[0]
    x_train_with_topo = x_y[:,:-1]
    y_train = x_y[:,-1]

    return x_train_with_topo,y_train


def extract_x_test_features(x_train,y_train,players_df,pipeline):
    """Extract the topological features from the test set. This requires also the train set

    Parameters
    ----------
    x_train:
        The x used in the training phase
    y_train:
        The 'y' used in the training phase
    players_df: pd.DataFrame
        The DataFrame containing the matches with all the players,from which to extract the test set
    pipeline: Pipeline
        The Giotto pipeline

    Returns
    -------
    x_test:
        The x_test with the topological features
    """
    x_train_no_topo = x_train[:,:14]
    y_test = np.zeros(len(players_df))  # Artificial y_test for features computation
    print('Y_TEST',y_test.shape)

    x_test_topo = extract_features_for_prediction(x_train_no_topo,players_df.values,y_test,pipeline)

    return x_test_topo

def extract_topological_features(diagrams):
    metrics = ['bottleneck','wasserstein','landscape','betti','heat']
    new_features = []
    for metric in metrics:
        amplitude = Amplitude(metric=metric)
        new_features.append(amplitude.fit_transform(diagrams))
    new_features = np.concatenate(new_features,axis=1)
    return new_features

def extract_features_for_prediction(x_train,x_test,pipeline):
    shift = 10
    top_features = []
    all_x_train = x_train
    all_y_train = y_train
    for i in tqdm(range(0,len(x_test),shift)):
        #
        print(range(0,shift) )
        if i+shift > len(x_test):
            shift = len(x_test) - i
        batch = np.concatenate([all_x_train,x_test[i: i + shift]])
        batch_y = np.concatenate([all_y_train,y_test[i: i + shift].reshape((-1,))])
        diagrams_batch,_ = pipeline.fit_transform_resample(batch,batch_y)
        new_features_batch = extract_topological_features(diagrams_batch[-shift:])
        top_features.append(new_features_batch)
        all_x_train = np.concatenate([all_x_train,batch[-shift:]])
        all_y_train = np.concatenate([all_y_train,batch_y[-shift:]])
    final_x_test = np.concatenate([x_test,np.concatenate(top_features,axis=0)],axis=1)
    return final_x_test

def get_probabilities(model,team_ids):
    """Get the probabilities on the outcome of the matches contained in the test set

    Parameters
    ----------
    model:
        The model (must have the 'predict_proba' function)
    x_test:
        The test set
    team_ids: pd.DataFrame
        The DataFrame containing,for each match in the test set,the ids of the two teams
    Returns
    -------
    probabilities:
        The probabilities for each match in the test set
    """
    prob_pred = model.predict_proba(x_test)
    prob_match_df = pd.DataFrame(data=prob_pred,columns=['away_team_prob','draw_prob','home_team_prob'])
    prob_match_df = pd.concat([team_ids.reset_index(drop=True),prob_match_df],axis=1)
    return prob_match_df

工作代码：

best_pipeline_params,best_model_feat_params = get_best_params()

# 'best_pipeline_params' -> {'k_min': 50,'k_max': 175,'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000,'max_depth': 10,'random_state': 52,'max_features': 0.5}

pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',# SubSpaceExtraction(dist_percentage=0.1,k_max=175,k_min=50)),#('create_diagrams',VietorisRipsPersistence(n_jobs=-1))])

x_train,y_train = load_dataset()

# x_train.shape ->  (2565,19)
# y_train.shape -> (2565,)

x_test = extract_x_test_features(x_train,new_players_df_stats,pipeline)

# x_test.shape -> (380,24)

rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train,y_train)
matches_probabilities = get_probabilities(rf_model,team_ids)  # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities,'premier league')

但我收到错误：

ValueError: X has 24 features,but DecisionTreeClassifier is expecting 19 features as input.

加载的数据集 (X_train)：

Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   home_best_attack    2565 non-null   float64
 1   home_best_defense   2565 non-null   float64
 2   home_avg_attack     2565 non-null   float64
 3   home_avg_defense    2565 non-null   float64
 4   home_std_attack     2565 non-null   float64
 5   home_std_defense    2565 non-null   float64
 6   gk_home_player_1    2565 non-null   float64
 7   away_avg_attack     2565 non-null   float64
 8   away_avg_defense    2565 non-null   float64
 9   away_std_attack     2565 non-null   float64
 10  away_std_defense    2565 non-null   float64
 11  away_best_attack    2565 non-null   float64
 12  away_best_defense   2565 non-null   float64
 13  gk_away_player_1    2565 non-null   float64
 14  bottleneck_metric   2565 non-null   float64
 15  wasserstein_metric  2565 non-null   float64
 16  landscape_metric    2565 non-null   float64
 17  betti_metric        2565 non-null   float64
 18  heat_metric         2565 non-null   float64
 19  label               2565 non-null   float64

请注意，前 14 列是描述匹配的特征，剩下的 5 个特征（减去标签）是已经提取的拓扑特征。

问题似乎是当代码到达 extract_x_test_features() 和 extract_features_for_prediction() 时，它们应该获取拓扑特征并将训练数据集与其堆叠。

由于 X_train 已经有拓扑特征，所以它又增加了 5 个，所以我最终得到了 24 个特征。

不过，我不确定。我只是想把这个项目围绕在我的脑海里……以及这里是如何进行预测的。

如何使用上面的代码修复不匹配？

注意：

1- x_train 和 y_test 不是 dataframes 而是 numpy.ndarray

2 - 如果从以下链接克隆或下载项目，则此问题完全可以重现：

Github Link

解决方法

答案其实已经在问题中给出了。

您在问题中提到了 # x_test.shape -> (380,24) 和 # x_train.shape -> (2565,19)。由于很明显并且可以看出您的测试数据形状与您的训练数据不匹配，因此您的训练数据具有 19 个特征，而测试数据具有 24 个特征（它们必须包含相同数量的功能）因此，当您在此行的模型中提供 "X has 24 features,but DecisionTreeClassifier is expecting 19 features as input" 时，您会收到错误 x_test - get_probabilities(rf_model,x_test,team_ids)。

因此，您的测试数据必须与您的训练数据一样具有 24 个特征。

在你的 x_train 中有 19 个特征，而在 X_test 中有 24 个特征？这是为什么？

要解决它，请显示两个数据框（x_train 和 X_test）并尝试找出它们具有不同特征的原因。最后，您必须在每个数据框中具有相同的形状和相同的特征。否则，您将获得此错误。

可能是你导入的数据集有问题。

这里是如何使用 RandomSearchCV 为您的模型找到最佳参数

pipeline2= Pipeline([
     ('scaler',StandardScaler()),('clf',RandomForestClassifier(n_estimators=62,max_depth=16)),])

# cycle through your pickle file parameter combinations here:
  param_grid = {'n_estimators': list(range(30,100)),'max_depth': list(range(5,26)),'max_features': ['auto','sqrt']} 

  random_rf_class = RandomizedSearchCV(
      estimator = pipeline2['clf'],param_distributions= param_grid,n_iter = 10,scoring='accuracy',n_jobs=2,cv = 10,refit=True,return_train_score = True)

  random_rf_class.fit(X_train,y_train)

  predictions=random_rf_class.predict(X_test)

  print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))

  # Print the values used for both hyperparameters
  print(random_rf_class.cv_results_['param_max_depth'])
  print(random_rf_class.cv_results_['param_max_features'])

  print(random_rf_class.best_params_)
  print(random_rf_class.best_score_)

在此处返回具有 19 个特征的切片：

def extract_features_for_prediction(x_train,y_train,y_test,pipeline):
   (...)
   return final_x_test[:,:19]

排除错误并运行测试。

不过，我仍然没有明白它的要点。

我将奖励任何在此项目的上下文中向我解释测试集背后的想法的人，在项目笔记本中，可以在这里找到：

Project Notebook

cross-validation decision-tree python topological-sort