对于大型数据集，我是否将sklearn与SGDRegressor和Nystroem方法正确使用？

问题描述

就我而言，我想使用带有RBF内核的SVR来训练我的模型，但是我的训练集太大，其中包含约1600万个样本，每个样本的维数为1200。我阅读了Sklearn的手册，并说使用SGDRegressor和Nystroem方法代替大型数据集，因此我将数据分为32个批次，并通过 partial_fit 将其输入模型。问题是Nystroem方法，它是否适合1个批次的子集，然后以相同的方式转换剩余的所有31个批次？这是我的代码：

batch_size = 500000
n_batches = int(trainset.shape[0]/batch_size)
feature_map_nystroem = Nystroem(gamma=.031,n_components= 1000,random_state=1)
svm = SGDRegressor(loss='epsilon_insensitive',max_iter=1,alpha=0.001,epsilon=0.5,shuffle=False,warm_start=True)
X_batch,Y_batch= create_data(trainset[:500000]) #my own function 
feature_map_nystroem.fit(X_batch)  # using 1st batch to fit Nystroem Method

num_epochs = 10
for epoch in range(num_epochs):
    t = time.time()
    print('epoch: ',epoch)      
    trainset = shuffle(trainset) # affect rows
    trainset_batches = np.array_split(trainset,n_batches)
    it = 1
    for batch in trainset_batches:
        print('batch',it)
        X_train_batch,Y_train_batch = create_data(batch)  # my own function
        X_train_batch = feature_map_nystroem.transform(X_train_batch)
        svm.partial_fit(X_train_batch,Y_train_batch)
        it+=1
    print(time.time()-t)

1个周期后，我的RMSE为0.8767，然后我继续进行了30个周期的训练，但RMSE为0.8779。我是否编码正确，尤其是Nystroem方法的 fit ？非常感谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

machine-learning scikit-learn sgd svm svm