如何在不耗尽 RAM 的情况下将 SMOTE 应用于 tensorflow 数据集

问题描述

我一直在研究具有近 17k 个图像的不平衡数据集，并且我一直在尝试使用 imbalanced-learn 库实现过采样技术，例如 SMOTE。图像和标签作为张量加载，而不平衡学习库中可用的方法需要 numpy 数组。我已经尝试从 tensorflow 数据集中提取图像，但是在大约 10,000 张图像之后，我在 google colab 上的会话崩溃了，因为我的 RAM 用完了。我也试图寻找不同的方法，但我找不到其他任何方法。这就是为什么我想知道您是否有任何建议可以真正帮助我解决这个问题。

我遵循以下步骤：

我使用 tf.keras.preprocessing.image_dataset_from_directory 导入数据。

def create_dataset(folder_path,name,split,seed,shuffle=True):
  return tf.keras.preprocessing.image_dataset_from_directory(
    folder_path,labels='inferred',label_mode='categorical',color_mode='rgb',batch_size=32,image_size=(320,320),shuffle=shuffle,interpolation='bilinear',validation_split=split,subset=name,seed=seed)

valid_split = 0.3
train_set = create_dataset(dir_path,'training',valid_split,42,shuffle=True).prefetch(1)
valid_set = create_dataset(dir_path,'validation',shuffle=True).prefetch(1)

# output:
# Found 16718 files belonging to 38 classes.
# Using 11703 files for training.
# Found 16718 files belonging to 38 classes.
# Using 5015 files for validation.

然后我运行这行代码以将图像从 tf 数据集中作为 numpy 数组取出，但正如我在此时所说的那样，我的会话崩溃了。

X_train = np.concatenate([x for x,y in train_set],axis=0)

感谢您的支持。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

imbalanced-data numpy ram smote tensorflow