问题描述
我正在尝试在我的机器上使用拓扑数据分析 (TDA) 重现 this GitHub 项目。
我的步骤:
背景:
- 功能选择
为了决定哪些属性属于哪个组,我们创建了一个相关矩阵。 由此,我们看到有两个大的群体,其中玩家属性相互之间具有很强的相关性。因此,我们决定将属性分为两组, 一个总结球员的进攻特点,另一个总结防守。最后,由于守门员的统计数据完全不同 其他玩家,我们决定只考虑整体评分。下面,是可能的 查看每个玩家使用的 24 个功能:
攻击:“定位”、“穿越”、“终结”、“heading_accuracy”、“short_passing”、 “反应”,“截击”,“运球”,“曲线”,“free_kick_accuracy”,“加速度”, "sprint_speed","agility","penalties","vision","shot_power","long_shots" 防守:“拦截”、“侵略”、“盯防”、“站立铲球”、“滑动铲球”、 “长传” 守门员:“overall_rating”
从这组特征中,我们下一步要做的是,对于每个非守门员球员, 计算攻击属性和防御属性的均值。
最后,对于给定比赛中的每支球队,我们计算均值和标准差 从球队球员的这些数据来看,他们的进攻和防守,以及最好的 进攻和最佳防守。
通过这种方式,一场比赛由 14 个特征(GK 总体价值、最佳攻击、标准攻击、 平均进攻,最佳防守,标准防守,平均防守),映射了空间中的比赛, 遵循两支球队的特点。
- 特征提取
TDA 的目的是捕捉数据底层空间的结构。在我们的项目中,我们假设数据点的邻域隐藏了与比赛结果相关的有意义的信息。因此,我们探索了数据空间寻找 这种相关性。
方法:
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params,top_feat_params,top_model_feat_params,*_ = cv_output
return top_feat_params,top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:,:-1]
y_train = x_y[:,-1]
return x_train_with_topo,y_train
def extract_x_test_features(x_train,y_train,players_df,pipeline):
"""Extract the topological features from the test set. This requires also the train set
Parameters
----------
x_train:
The x used in the training phase
y_train:
The 'y' used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players,from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
Returns
-------
x_test:
The x_test with the topological features
"""
x_train_no_topo = x_train[:,:14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
print('Y_TEST',y_test.shape)
x_test_topo = extract_features_for_prediction(x_train_no_topo,players_df.values,y_test,pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck','wasserstein','landscape','betti','heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features.append(amplitude.fit_transform(diagrams))
new_features = np.concatenate(new_features,axis=1)
return new_features
def extract_features_for_prediction(x_train,x_test,pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0,len(x_test),shift)):
#
print(range(0,shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train,x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train,y_test[i: i + shift].reshape((-1,))])
diagrams_batch,_ = pipeline.fit_transform_resample(batch,batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
top_features.append(new_features_batch)
all_x_train = np.concatenate([all_x_train,batch[-shift:]])
all_y_train = np.concatenate([all_y_train,batch_y[-shift:]])
final_x_test = np.concatenate([x_test,np.concatenate(top_features,axis=0)],axis=1)
return final_x_test
def get_probabilities(model,team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
Parameters
----------
model:
The model (must have the 'predict_proba' function)
x_test:
The test set
team_ids: pd.DataFrame
The DataFrame containing,for each match in the test set,the ids of the two teams
Returns
-------
probabilities:
The probabilities for each match in the test set
"""
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred,columns=['away_team_prob','draw_prob','home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True),prob_match_df],axis=1)
return prob_match_df
工作代码:
best_pipeline_params,best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50,'k_max': 175,'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000,'max_depth': 10,'random_state': 52,'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',# SubSpaceExtraction(dist_percentage=0.1,k_max=175,k_min=50)),#('create_diagrams',VietorisRipsPersistence(n_jobs=-1))])
x_train,y_train = load_dataset()
# x_train.shape -> (2565,19)
# y_train.shape -> (2565,)
x_test = extract_x_test_features(x_train,new_players_df_stats,pipeline)
# x_test.shape -> (380,24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train,y_train)
matches_probabilities = get_probabilities(rf_model,team_ids) # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities,'premier league')
但我收到错误:
ValueError: X has 24 features,but DecisionTreeClassifier is expecting 19 features as input.
加载的数据集 (X_train
):
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_best_attack 2565 non-null float64
1 home_best_defense 2565 non-null float64
2 home_avg_attack 2565 non-null float64
3 home_avg_defense 2565 non-null float64
4 home_std_attack 2565 non-null float64
5 home_std_defense 2565 non-null float64
6 gk_home_player_1 2565 non-null float64
7 away_avg_attack 2565 non-null float64
8 away_avg_defense 2565 non-null float64
9 away_std_attack 2565 non-null float64
10 away_std_defense 2565 non-null float64
11 away_best_attack 2565 non-null float64
12 away_best_defense 2565 non-null float64
13 gk_away_player_1 2565 non-null float64
14 bottleneck_metric 2565 non-null float64
15 wasserstein_metric 2565 non-null float64
16 landscape_metric 2565 non-null float64
17 betti_metric 2565 non-null float64
18 heat_metric 2565 non-null float64
19 label 2565 non-null float64
请注意,前 14 列是描述匹配的特征,剩下的 5 个特征(减去标签)是已经提取的拓扑特征。
问题似乎是当代码到达 extract_x_test_features()
和 extract_features_for_prediction()
时,它们应该获取拓扑特征并将训练数据集与其堆叠。
由于 X_train 已经有拓扑特征,所以它又增加了 5 个,所以我最终得到了 24 个特征。
不过,我不确定。我只是想把这个项目围绕在我的脑海里……以及这里是如何进行预测的。
如何使用上面的代码修复不匹配?
注意:
1- x_train 和 y_test 不是 dataframes
而是 numpy.ndarray
2 - 如果从以下链接克隆或下载项目,则此问题完全可以重现:
解决方法
答案其实已经在问题中给出了。
您在问题中提到了 # x_test.shape -> (380,24)
和 # x_train.shape -> (2565,19)
。由于很明显并且可以看出您的测试数据形状与您的训练数据不匹配,因此您的训练数据具有 19
个特征,而测试数据具有 24
个特征(它们必须包含相同数量的功能)因此,当您在此行的模型中提供 "X has 24 features,but DecisionTreeClassifier is expecting 19 features as input"
时,您会收到错误 x_test
- get_probabilities(rf_model,x_test,team_ids)
。
因此,您的测试数据必须与您的训练数据一样具有 24 个特征。
,在你的 x_train 中有 19 个特征,而在 X_test 中有 24 个特征?这是为什么?
要解决它,请显示两个数据框(x_train 和 X_test)并尝试找出它们具有不同特征的原因。最后,您必须在每个数据框中具有相同的形状和相同的特征。否则,您将获得此错误。
可能是你导入的数据集有问题。
,这里是如何使用 RandomSearchCV 为您的模型找到最佳参数
pipeline2= Pipeline([
('scaler',StandardScaler()),('clf',RandomForestClassifier(n_estimators=62,max_depth=16)),])
# cycle through your pickle file parameter combinations here:
param_grid = {'n_estimators': list(range(30,100)),'max_depth': list(range(5,26)),'max_features': ['auto','sqrt']}
random_rf_class = RandomizedSearchCV(
estimator = pipeline2['clf'],param_distributions= param_grid,n_iter = 10,scoring='accuracy',n_jobs=2,cv = 10,refit=True,return_train_score = True)
random_rf_class.fit(X_train,y_train)
predictions=random_rf_class.predict(X_test)
print("Model accuracy {}%".format(accuracy_score(y_test,predictions)*100))
# Print the values used for both hyperparameters
print(random_rf_class.cv_results_['param_max_depth'])
print(random_rf_class.cv_results_['param_max_features'])
print(random_rf_class.best_params_)
print(random_rf_class.best_score_)
,
在此处返回具有 19 个特征的切片:
def extract_features_for_prediction(x_train,y_train,y_test,pipeline):
(...)
return final_x_test[:,:19]
排除错误并运行测试。
不过,我仍然没有明白它的要点。
我将奖励任何在此项目的上下文中向我解释测试集背后的想法的人,在项目笔记本中,可以在这里找到: