问题描述
我正在尝试通过使用stellargraph的WatchYourStep算法来训练大型图嵌入。
由于某种原因,该模型仅在cpu上训练,不使用GPU 。
使用:
- tensorflow-gpu 2.3.1
- 具有2个GPU,cuda 10.1
- 在nvidia-docker容器中运行。
- 我知道tesnorflow确实找到了GPU。 (
tf.debugging.set_log_device_placement(True)
) - 我尝试在
with tf.device('/GPU:0'):
下运行 - 我尝试用
tf.distribute.MirroredStrategy()
运行它。 - 试图卸载tensorflow并重新安装tensorflow-gpu。
尽管如此,当运行 nvidia-smi 时,我在GPU上看不到任何活动,并且培训非常缓慢。
如何调试呢?
def watch_your_step_model():
'''use the config to geenrate the WatchYourStep model'''
cfg = load_config()
generator = generator_for_watch_your_step()
num_walks = cfg['num_walks']
embedding_dimension = cfg['embedding_dimension']
learning_rate = cfg['learning_rate']
wys = WatchYourStep(
generator,num_walks=num_walks,embedding_dimension=embedding_dimension,attention_regularizer=regularizers.l2(0.5),)
x_in,x_out = wys.in_out_tensors()
model = Model(inputs=x_in,outputs=x_out)
model.compile(loss=graph_log_likelihood,optimizer=optimizers.Adam(learning_rate))
return model,generator,wys
def train_watch_your_step_model(epochs = 3000):
cfg = load_config()
batch_size = cfg['batch_size']
steps_per_epoch = cfg['steps_per_epoch']
callbacks,checkpoint_file = watch_your_step_callbacks(cfg)
# strategy = tf.distribute.MirroredStrategy()
# print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
# with strategy.scope():
model,wys = watch_your_step_model()
train_gen = generator.flow(batch_size=batch_size,num_parallel_calls=8)
train_gen.prefetch(20480000)
history = model.fit(
train_gen,epochs=epochs,verbose=1,steps_per_epoch=steps_per_epoch,callbacks = callbacks
)
copy_last_trained_wys_weights_to_data()
return history,checkpoint_file
with tf.device('/GPU:0'):
train_watch_your_step_model()
解决方法
我只是按照以下说明操作:https://github.com/stellargraph/stellargraph/issues/546。
它对我有用。
基本上,您必须从 stellargraph github 编辑文件 setup.py 并删除 tensorflow 要求(第 25 和 27 行 https://github.com/stellargraph/stellargraph/blob/develop/setup.py)。