如何使用经过训练的模型改进对不同数据集的预测？

问题描述

我有两个数据集。这些数据集包括化学分子及其溶解度值（数值）。目标变量（因变量）是溶解度。机器学习模型的输入是从分子结构（也称为指纹）中提取的特征。

数据集 A： 193 个观察或行（用于训练和验证）

数据集 B： 151 个观察或行（仅用于预测）

问题类型：回归

模型：高斯过程回归量

两个数据集的特征或列相似。

这是我试过的程序：

1) Train / validate model on dataset A by LOOCV (R2 = 0.75)
2) Fit model on the whole train dataset or dataset A (R2 > 0.9)
3) Use trained model to predict dataset B (R2 < 0.1)

注意 1：我不允许合并或洗牌数据集。

注意 2：我不使用分子结构作为模型的输入特征。我使用 RDKit python 库从分子（称为指纹）中提取特征。两个数据集的特征数量和数据预处理步骤相似。

我注意到数据集 A 和 B 中的分子是不同的。该模型似乎应该解决外推问题。

我的问题是当数据集不同时我如何改进预测。

一个想法可能是使用观察（一行数据）作为“参考”，并根据差异训练模型（解释如下）：

1) Select one observation from dataset A as Reference
2) For observations in dataset A,calculate the difference of features and target variable with the Reference. 
3) Train model on these difference values  
4) For observations in dataset B,calculate the difference of features with the Reference. 
5) Use these differences as input to the trained model to predict target variable
6) Add the value of target variable for Reference to the predicted target values in step 5 for dataset B

我不确定使用 Reference 是否是一个有效的想法。如果您有任何改进预测的想法，请告诉我。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

gaussian-process prediction reference reference reference regression

如何使用经过训练的模型改进对不同数据集的预测？

问题描述

解决方法

相关问答