问题描述
我收到错误
[LightGBM] [致命] 检查失败:(train_data->num_features()) > (0)
对于形状为 (40,7) 的数据集 X。我正在尝试为自定义损失函数运行梯度提升
在线上出现错误
gbm.fit(
X_train,y_train,eval_set=[(X_valid,y_valid)],eval_metric=custom_asymmetric_valid,verbose=False,)
完整代码如下:
import lightgbm
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
train = pd.read_csv("Data_Train.csv")
X,y = train.iloc[:,1:-1],train.iloc[:,-1]
X_train,X_valid,y_valid = train_test_split(X,y,test_size=0.20,random_state=42)
print(np.shape(X_train),np.shape(X_valid))
test = pd.read_csv("Data_Test.csv")
X_test,y_test = test.iloc[:,test.iloc[:,-1]
# Defining custom loss function
def custom_asymmetric_train(y_true,y_pred):
residual = (y_true - y_pred).astype("float")
grad = np.where(residual<0,-2*10.0*residual,-2*residual)
hess = np.where(residual<0,2*10.0,2.0)
return grad,hess
def custom_asymmetric_valid(y_true,y_pred):
residual = (y_true - y_pred).astype("float")
loss = np.where(residual < 0,(residual**2)*10.0,residual**2)
return "custom_asymmetric_eval",np.mean(loss),False
# default lightgbm model with sklearn api
gbm = lightgbm.LGBMRegressor(random_state=33)
# updating objective function to custom
# default is "regression"
# also adding metrics to check different scores
gbm.set_params(**{'objective': custom_asymmetric_train},metrics = ["mse",'mae'])
# fitting model
gbm.fit(
X_train,)
y_pred = gbm.predict(X_valid)
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train,free_raw_data=False)
lgb_eval = lgb.Dataset(X_valid,y_valid,reference=lgb_train,free_raw_data=False)
params = {
'objective': 'regression','verbose': 0
}
gbm = lgb.train(params,lgb_train,num_boost_round=10,init_model=gbm,fobj=custom_asymmetric_train,feval=custom_asymmetric_valid,valid_sets=lgb_eval)
y_pred = gbm.predict(X_valid)
解决方法
您的原始示例无法完全重现(因为 "Data_Train.csv"
的内容未共享),但我可以使用 LightGBM 3.1.1(安装在pip install lightgbm
).
import lightgbm as lgb
import numpy as np
import pandas as pd
np.random.seed(708)
def custom_asymmetric_train(y_true,y_pred):
residual = (y_true - y_pred).astype("float")
grad = np.where(residual<0,-2*10.0*residual,-2*residual)
hess = np.where(residual<0,2*10.0,2.0)
return grad,hess
# create a training dataset of shape (40,7)
X = pd.DataFrame({
f"feat_{i}": np.random.random((40,))
for i in range(7)
})
y = np.random.random((40,))
gbm = lgb.LGBMRegressor()
gbm.set_params(**{'objective': custom_asymmetric_train},metrics = ["mse",'mae'])
gbm.fit(X,y)
LightGBMError:检查失败:(train_data->num_features()) > (0)
LightGBM 有一些用于防止过拟合的参数。在这种情况下,有两个是相关的:
-
min_data_in_leaf
(默认值 = 20) -
min_sum_hessian_in_leaf
(默认值=0.001)
默认情况下,在构建 Dataset
对象期间,LightGBM 会根据这些条件过滤掉无法拆分的特征(请参阅 feature_pre_filter
。
LightGBM 的参数默认值旨在在中等规模的数据集上提供良好的性能。形状为 (40,7)
的数据集非常小,这增加了所有特征不可分割的风险。
为了适应如此小的数据集,您可以覆盖默认值并将它们设置为 0 或更小的值。下面的代码训练成功,没有错误。
import lightgbm as lgb
import numpy as np
import pandas as pd
np.random.seed(708)
def custom_asymmetric_train(y_true,))
gbm = lgb.LGBMRegressor(
min_sum_in_hessian=0,min_data_in_leaf=0
)
gbm.set_params(**{'objective': custom_asymmetric_train},y)