在使用 ImageDataGenerator 类进行图像数据增强时，在使用 TPU 训练卷积神经网络 (CNN) 时遇到问题吗？

问题描述

最近我一直在训练一个 CNN，即 AlexNet，用于将大脑 MRI 图像分类为四类，但是当我在我的 Google Colab 运行时的 cpu 或 GPU 上训练它时，它花费了很多时间，即大约 5 小时。我想将我的训练过程迁移到 TPU，因为硬件是专门为进行矩阵计算而构建的，但我收到以下错误并且找不到任何方法来解决该错误。

TensorFlow 版本：2.5.0

用于检查和初始化 TPU（如果在运行时分配）的源代码：

print("OS Version & Details: ")
!lsb_release -a
print()

gpu_device_location = tpu_device_location = cpu_device_location = None

if os.environ['COLAB_GPU'] == '1':
    print("Allocated GPU Runtime Details:")
    !nvidia-smi
    print()
    try:
        import pynvml
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        gpu_device_name = pynvml.nvmlDeviceGetName(handle)
 
        if gpu_device_name not in {b'Tesla T4',b'Tesla P4',b'Tesla P100-PCIE-16GB'}:
            raise Exception("Unfortunately this instance does not have a T4,P4 or P100 GPU.\nSometimes Colab allocates a Tesla K80 instead of a T4,P4 or P100.\nIf you get Tesla K80 then you can factory reset your runtime to get another GPUs.")
    except Exception as hardware_exception:
        print(hardware_exception,end = '\n\n')
    gpu_device_location = tf.test.gpu_device_name()
    print(f"{gpu_device_name.decode('utf-8')} is allocated sucessfully at location: {gpu_device_location}")
elif 'COLAB_TPU_ADDR' in os.environ:
    tpu_device_location = f"grpc://{os.environ['COLAB_TPU_ADDR']}"
    print(f"TPU is allocated successfully at location: {tpu_device_location}.")
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_location)
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    tpu_strategy = tf.distribute.TPUStrategy()
else:
    cpu_device_location = "/cpu:0"
    print("GPUs and TPUs are not allocated successfully,hence runtime fallbacked to cpu.")

使用 ImageDataGenerator 的数据增强：

image_size = 224
batch_size = 16

image_datagen_kwargs = dict(rescale = 1 / 255,rotation_range = 15,width_shift_range = 0.1,zoom_range = 0.01,shear_range = 0.01,brightness_range = [0.3,1.5],horizontal_flip = True,vertical_flip = True)

train_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
validation_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
test_image_datagen = ImageDataGenerator(**image_datagen_kwargs)

train_dataset = train_image_datagen.flow_from_dataframe(train_data,x_col = 'image_filepaths',y_col = 'tumor_class',seed = 42,batch_size = batch_size,target_size = (image_size,image_size),color_mode = 'grayscale')
validation_dataset = validation_image_datagen.flow_from_dataframe(validation_data,color_mode = 'grayscale')
test_dataset = test_image_datagen.flow_from_dataframe(test_data,color_mode = 'grayscale')

基本上发生的事情是，一旦您创建了 ImageDataGenerator 类的实例，您就可以调用方法 flow_from_dataframe()，它返回一个 DataFrameIterator 类的实例，您可以使用它来迭代变体根据您想要的变化创建的图像。

使用 keras 的 AlexNet CNN 架构：

AlexNet_cnn = Sequential()
    AlexNet_cnn.add(Conv2D(96,kernel_size = 11,strides = 4,activation = 'relu',input_shape = (image_size,image_size,1),name = 'Conv2D-1'))
    AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-1'))
    AlexNet_cnn.add(MaxPool2D(pool_size = 3,strides = 2,name = 'max-pooling-1'))
    AlexNet_cnn.add(Conv2D(256,kernel_size = 5,padding = 'same',name = 'Conv2D-2'))
    AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-2'))
    AlexNet_cnn.add(MaxPool2D(pool_size = 3,name = 'max-pooling-2'))
    AlexNet_cnn.add(Conv2D(384,kernel_size = 3,name = 'Conv2D-3'))
    AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-3'))
    AlexNet_cnn.add(Conv2D(384,name = 'Conv2D-4'))
    AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-4'))
    AlexNet_cnn.add(Conv2D(256,name = 'Conv2D-5'))
    AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-5'))
    AlexNet_cnn.add(MaxPool2D(pool_size = 3,name = 'max-pooling-3'))
    AlexNet_cnn.add(Flatten(name = 'Flatten-Layer-1'))
    AlexNet_cnn.add(Dense(1024,name = 'Hidden-Layer-1'))
    AlexNet_cnn.add(Dropout(rate = 0.5,name = 'Dropout-Layer-1'))
    AlexNet_cnn.add(Dense(4,activation = 'softmax',name = 'Output-Layer'))
    AlexNet_cnn.compile(optimizer = 'Adam',loss = 'categorical_crossentropy',metrics = ['accuracy'])

当我开始使用以下代码训练上述 CNN 时：

AlexNet_train_history = AlexNet_cnn.fit(train_dataset,validation_data = validation_dataset,epochs = cnn_epochs)

我遇到的错误如下：

UnavailableError: 8 root error(s) found.
  (0) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"Failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultideviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_6849197215061331409/_5/_261]]
  (1) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","grpc_status":14}]}
     [[{{node MultideviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[OptionalHasValue_6/_14]]
     [[OptionalHasValue_8/_17]]
  (2) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","grpc_status":14}]}
     [[{{node MultideviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_109/_308]]
  (3) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","grpc_status":14}]}
     [[{{node MultideviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_12/switch_pre ... [truncated]

我搜索了上述错误 ImageDataGenerator does not work with tpu #34346，结果发现在旧版本的 tensorflow 中，TPU 不适用于 DataFrameIterators。

有没有办法解决上述问题，或者有没有办法将DataFrameIterator的实例转换成TPU支持的TFRecord等实例？

解决方法

我遇到了同样的问题。尝试使用：

train_image_datagen = tf.keras.preprocess.image.ImageDataGenerator

尝试使用 tf.keras.preprocessing.image_dataset_from_directory 或 tf.data.Dataset 并将其与 Keras 预处理 layers 结合使用。

data-augmentation deep-learning keras keras tensorflow tensorflow tensorflow tpu