如何在使用 tf.keras 的分布式训练中提升其他 GPU 的利用率？

问题描述

python3.6、tf2.3、win10、cuda10.1

当我测试 a-layer ConvLSTM 模型进行分布式训练时，GPU-Util 存在问题。

这是我的代码。

cpus = tf.config.list_physical_devices(device_type='cpu')
gpus = tf.config.list_physical_devices(device_type='GPU')

for gpu in gpus:
    tf.config.experimental.set_memory_growth(device=gpu,enable=True)

tf.config.set_visible_devices(devices=gpus,device_type='GPU')
strategy = tf.distribute.MirroredStrategy(cross_device_ops=tf.distribute.HierarchicalcopyAllReduce())

opt = tf.keras.optimizers.Adam(learning_rate=0.003)
loss = tf.keras.losses.mse

with strategy.scope():
    model = model_build(input_shape=(10,230,1))
    model.compile(optimizer=opt,loss=loss,metrics=['accuracy'])

dataset = np.random.random((40,10,1))
truth = np.random.random((40,1))

history = model.fit(x=dataset,y=truth,batch_size=8,)

一开始，我把所有的gpus设置为分布式训练，看到GPU-Util是这样的： 4-GPU Util detail 只有一个运行良好，而其他的利用率很低。

然后修改了一些代码，我尝试使用2个gpus进行分发。

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalcopyAllReduce())

dataset = np.random.random((20,1))
truth = np.random.random((20,batch_size=4,)

gpu-util 的结果是这样的：2-GPU Util detail 看起来有点令人满意：gpu-util 都高于 4-GPU 测试。那么如何在使用 tf.keras 的分布式训练中在高层控制更多 gpus 的 util 呢？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

distributed-tensorflow keras keras python tensorflow tensorflow tensorflow