问题描述
最近我一直在训练一个 CNN,即 AlexNet,用于将大脑 MRI 图像分类为四类,但是当我在我的 Google Colab 运行时的 cpu 或 GPU 上训练它时,它花费了很多时间,即大约 5 小时。我想将我的训练过程迁移到 TPU,因为硬件是专门为进行矩阵计算而构建的,但我收到以下错误并且找不到任何方法来解决该错误。
TensorFlow 版本:2.5.0
用于检查和初始化 TPU(如果在运行时分配)的源代码:
print("OS Version & Details: ")
!lsb_release -a
print()
gpu_device_location = tpu_device_location = cpu_device_location = None
if os.environ['COLAB_GPU'] == '1':
print("Allocated GPU Runtime Details:")
!nvidia-smi
print()
try:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
gpu_device_name = pynvml.nvmlDeviceGetName(handle)
if gpu_device_name not in {b'Tesla T4',b'Tesla P4',b'Tesla P100-PCIE-16GB'}:
raise Exception("Unfortunately this instance does not have a T4,P4 or P100 GPU.\nSometimes Colab allocates a Tesla K80 instead of a T4,P4 or P100.\nIf you get Tesla K80 then you can factory reset your runtime to get another GPUs.")
except Exception as hardware_exception:
print(hardware_exception,end = '\n\n')
gpu_device_location = tf.test.gpu_device_name()
print(f"{gpu_device_name.decode('utf-8')} is allocated sucessfully at location: {gpu_device_location}")
elif 'COLAB_TPU_ADDR' in os.environ:
tpu_device_location = f"grpc://{os.environ['COLAB_TPU_ADDR']}"
print(f"TPU is allocated successfully at location: {tpu_device_location}.")
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_location)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
tpu_strategy = tf.distribute.TPUStrategy()
else:
cpu_device_location = "/cpu:0"
print("GPUs and TPUs are not allocated successfully,hence runtime fallbacked to cpu.")
使用 ImageDataGenerator 的数据增强:
image_size = 224
batch_size = 16
image_datagen_kwargs = dict(rescale = 1 / 255,rotation_range = 15,width_shift_range = 0.1,zoom_range = 0.01,shear_range = 0.01,brightness_range = [0.3,1.5],horizontal_flip = True,vertical_flip = True)
train_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
validation_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
test_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
train_dataset = train_image_datagen.flow_from_dataframe(train_data,x_col = 'image_filepaths',y_col = 'tumor_class',seed = 42,batch_size = batch_size,target_size = (image_size,image_size),color_mode = 'grayscale')
validation_dataset = validation_image_datagen.flow_from_dataframe(validation_data,color_mode = 'grayscale')
test_dataset = test_image_datagen.flow_from_dataframe(test_data,color_mode = 'grayscale')
基本上发生的事情是,一旦您创建了 ImageDataGenerator
类的实例,您就可以调用方法 flow_from_dataframe()
,它返回一个 DataFrameIterator
类的实例,您可以使用它来迭代变体根据您想要的变化创建的图像。
使用 keras 的 AlexNet CNN 架构:
AlexNet_cnn = Sequential()
AlexNet_cnn.add(Conv2D(96,kernel_size = 11,strides = 4,activation = 'relu',input_shape = (image_size,image_size,1),name = 'Conv2D-1'))
AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-1'))
AlexNet_cnn.add(MaxPool2D(pool_size = 3,strides = 2,name = 'max-pooling-1'))
AlexNet_cnn.add(Conv2D(256,kernel_size = 5,padding = 'same',name = 'Conv2D-2'))
AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-2'))
AlexNet_cnn.add(MaxPool2D(pool_size = 3,name = 'max-pooling-2'))
AlexNet_cnn.add(Conv2D(384,kernel_size = 3,name = 'Conv2D-3'))
AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-3'))
AlexNet_cnn.add(Conv2D(384,name = 'Conv2D-4'))
AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-4'))
AlexNet_cnn.add(Conv2D(256,name = 'Conv2D-5'))
AlexNet_cnn.add(Batchnormalization(name = 'Batch-normalization-5'))
AlexNet_cnn.add(MaxPool2D(pool_size = 3,name = 'max-pooling-3'))
AlexNet_cnn.add(Flatten(name = 'Flatten-Layer-1'))
AlexNet_cnn.add(Dense(1024,name = 'Hidden-Layer-1'))
AlexNet_cnn.add(Dropout(rate = 0.5,name = 'Dropout-Layer-1'))
AlexNet_cnn.add(Dense(4,activation = 'softmax',name = 'Output-Layer'))
AlexNet_cnn.compile(optimizer = 'Adam',loss = 'categorical_crossentropy',metrics = ['accuracy'])
当我开始使用以下代码训练上述 CNN 时:
AlexNet_train_history = AlexNet_cnn.fit(train_dataset,validation_data = validation_dataset,epochs = cnn_epochs)
我遇到的错误如下:
UnavailableError: 8 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"Failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultideviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_6849197215061331409/_5/_261]]
(1) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","grpc_status":14}]}
[[{{node MultideviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[OptionalHasValue_6/_14]]
[[OptionalHasValue_8/_17]]
(2) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","grpc_status":14}]}
[[{{node MultideviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_109/_308]]
(3) Unavailable: {{function_node __inference_train_function_38767}} Failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:cpu:0:
:{"created":"@1622146086.692146903","grpc_status":14}]}
[[{{node MultideviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_12/switch_pre ... [truncated]
我搜索了上述错误 ImageDataGenerator does not work with tpu #34346,结果发现在旧版本的 tensorflow
中,TPU 不适用于 DataFrameIterators
。
有没有办法解决上述问题,或者有没有办法将DataFrameIterator
的实例转换成TPU支持的TFRecord
等实例?
解决方法
我遇到了同样的问题。尝试使用:
train_image_datagen = tf.keras.preprocess.image.ImageDataGenerator
,
尝试使用 tf.keras.preprocessing.image_dataset_from_directory
或 tf.data.Dataset
并将其与 Keras 预处理 layers 结合使用。