如果无法在预测中使用编码,其目的是什么

问题描述

这是this question的后续行动。

我认为我们执行OneHotEncoding的原因是将字符串数据转换为numpy数组吗?

然后,Predict语句 val_predictions = soccer_model.predict(val_X) 应该可以正常使用编码数据。

这是我到目前为止的代码:

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor

# Set option to display all the rows and columns in the dataset. If there are more rows,adjust number accordingly.
pd.set_option('display.max_rows',5000)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)

# Pandas needs you to define the column as date before its imported and then call the column and define as a date
# hence this step.
date_col = ['Date']
df = pd.read_csv(
    r'C:\Users\harsh\Documents\My Dream\Desktop\Machine Learning\Attempt1\Historical Data\Concat_Cleaned.csv',parse_dates=date_col,skiprows=0,low_memory=False)

# Converting/defining the columns
# Before you define column types,you need to fill all NaN with a value. We will be reconverting them later
df = df.fillna(101)
# Defining column types
convert_dict = {'League_Division': str,'HomeTeam': str,'AwayTeam': str,'Full_Time_Home_Goals': int,'Full_Time_Away_Goals': int,'Full_Time_Result': str,'Half_Time_Home_Goals': int,'Half_Time_Away_Goals': int,'Half_Time_Result': str,'Attendance': int,'Referee': str,'Home_Team_Shots': int,'Away_Team_Shots': int,'Home_Team_Shots_on_Target': int,'Away_Team_Shots_on_Target': int,'Home_Team_Hit_Woodwork': int,'Away_Team_Hit_Woodwork': int,'Home_Team_Corners': int,'Away_Team_Corners': int,'Home_Team_Fouls': int,'Away_Team_Fouls': int,'Home_Offsides': int,'Away_Offsides': int,'Home_Team_Yellow_Cards': int,'Away_Team_Yellow_Cards': int,'Home_Team_Red_Cards': int,'Away_Team_Red_Cards': int,'Home_Team_Bookings_Points': float,'Away_Team_Bookings_Points': float,}

df = df.astype(convert_dict)

# Reverting the replace values step to get original dataframe and with the defined filetypes
df = df.replace('101',np.NAN,regex=True)
df = df.replace(101,regex=True)

# Clean dataset by dropping null rows
data = df.dropna(axis=0)

# Column that you want to predict = y
y = data.Full_Time_Home_Goals

# Columns that are inputted into the model to make predictions (dependants),Cannot be column y
features = ['HomeTeam','AwayTeam','Full_Time_Away_Goals','Full_Time_Result']
# Create X
X = data[features]

# Split into validation and training data
train_X,val_X,train_y,val_y = train_test_split(X,y,random_state=1)

# Specify Model
soccer_model = DecisionTreeRegressor(random_state=1)

# Define and train OneHotEncoder to transform numerical data to a numeric array
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_X)

transformed_train_X = enc.transform(train_X)

# Fit Model
soccer_model.fit(transformed_train_X,train_y)

#  Make validation predictions and calculate mean absolute error
val_predictions = soccer_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions,val_y)
print("Validation MAE when not specifying max_leaf_nodes : {:,.0f}".format(val_mae))

我遇到的错误是

val_predictions = soccer_model.predict(val_X)

我得到的错误是:

ValueError:无法将字符串转换为float:'Wolves'

您可以找到我的示例数据集here

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)