OoM：在张量流模型上使用Talos进行超参数优化期间出现内存不足错误

问题描述

在Talos的帮助下为我的AlexNet 搜索最佳超参数时，出现“内存不足错误”。即使我略微更改了参数，它也总是在同一时期（32/240）发生（以排除原因是不利的星座）。

错误消息：

ResourceExhaustedError:  OOM when allocating tensor with shape[32,96,26,26] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node max_pooling2d_1/MaxPool (defined at D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens,add report_tensor_allocations_upon_oom to Runoptions for current allocation info.
 [Op:__inference_keras_scratch_graph_246047]

Function call stack:
keras_scratch_graph

这是我的代码：

会话配置：

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction = 0.99
sess = tf.compat.v1.Session(config = config)
K.set_session(sess)

AlexNet的配置和安装：

def AlexNet(x_train,y_train,x_val,y_val,params):
    
    K.clear_session()
    
    if params['activation'] == 'leakyrelu':
        activation_layer = LeakyReLU(alpha = params['leaky_alpha'])
    elif params['activation'] == 'relu':
        activation_layer = ReLU()
    
    model = Sequential([
        Conv2D(filters=96,kernel_size=(11,11),strides=(4,4),activation='relu',input_shape=(224,224,Global.num_image_channels)),Batchnormalization(),MaxPooling2D(pool_size=(3,3),strides=(2,2)),Conv2D(filters=256,kernel_size=(5,5),strides=(1,1),padding="same"),Conv2D(filters=384,kernel_size=(3,kernel_size=(1,Flatten(),Dense(4096,activation=activation_layer),Dropout(0.5),#todo
        Dense(4096,#todo
        Dense(units = 2,activation=activation_layer)
        #Dense(10,activation='softmax')
    ])
        
    model.compile(
        optimizer = params['optimizer'](lr = lr_normalizer(params['lr'],params['optimizer'])),loss = Global.loss_funktion,metrics = get_reduction_metric(Global.reduction_metric)
    )
    train_generator,valid_generator = create_data_pipline(params['batch_size'],params['samples'])
    tg_steps_per_epoch = train_generator.n // train_generator.batch_size
    vg_validation_steps = valid_generator.n // valid_generator.batch_size
    print('Steps per Epoch: {},Validation Steps: {}'.format(tg_steps_per_epoch,vg_validation_steps))
    
    
    startTime = datetime.Now()
    
    out = model.fit(
        x = train_generator,epochs = params['epochs'],validation_data = valid_generator,steps_per_epoch = tg_steps_per_epoch,validation_steps = vg_validation_steps,#callbacks = [checkpointer]
        workers = 8
    )
    print("Time taken:",datetime.Now() - startTime)

    return out,model

超参数列表：

hyper_parameter = {
    'samples': [20000],'epochs': [1],'batch_size': [32,64],'optimizer': [Adam],'lr': [1,2],'first_neuron': [1024,2048,4096],'dropout': [0.25,0.50],'activation': ['leakyrelu','relu'],'hidden_layers': [0,1,2,3,4],'leaky_alpha': [0.1] #Default bei LeakyReLU,sonst PReLU
}

运行Talos：

dummy_x = np.empty((1,224))
dummy_y = np.empty((1,2))

with tf.device('/device:GPU:0'):
    
        t = ta.Scan(
            x = dummy_x,y = dummy_y,model = AlexNet,params = hyper_parameter,experiment_name = '{}'.format(Global.dataset),#shuffle=False,reduction_metric = Global.reduction_metric,disable_progress_bar = False,print_params = True,clear_session = 'tf',save_weights = False
        )
        

t.data.to_csv(Global.target_dir + Global.results,index = True)

内存使用率始终很高，但不会随时间推移而增加，但会有所不同。

Nvidia SMI输出：

有人可以在这里帮助我吗？

================================================ ========================== 我已经尝试过的方法：

1）拆分Talos运行：

这引起了同样的错误。

hyper_parameter = {
    'samples': [20000],5],'first_neuron': [9999],'hidden_layers': [9999],sonst PReLU
}

dummy_x = np.empty((1,2))
first = True

for h in [0,4]:
    hyper_parameter['hidden_layers']=[h]
    for fn in [1024,4096]:
        hyper_parameter['first_neuron']=[fn]

        with tf.device('/device:GPU:1'):

                t = ta.Scan(
                    x = dummy_x,save_weights = False
                )
                if(first):
                    t.data.to_csv(Global.target_dir + Global.results,index = True,mode='a')
                    first = False
                else:
                    t.data.to_csv(Global.target_dir + Global.results,mode='a',header=False)

================================================ =========================

2）在自己的线程中运行模型

寻找原因，我发现有些人抱怨同样的问题，并责怪TensorFlow没有执行K.clear_session()。

也许这个想法很愚蠢，但是我试图在额外的线程中训练模型。

from threading import Thread
def gen_model_thread(x_train,params):
    
    thread = Thread(target=AlexNet,args=(x_train,params))
    thread.start()
    return_value = thread.join()
    return return_value

with tf.device('/device:GPU:0'):
    
        t = ta.Scan(
            x = dummy_x,model = gen_model_thread,clear_session = True,save_weights = False
        )

这导致类型错误：

Traceback (most recent call last):
  File "D:\anaconda\envs\tf_ks\lib\threading.py",line 926,in _bootstrap_inner
    self.run()
  File "D:\anaconda\envs\tf_ks\lib\threading.py",line 870,in run
    self._target(*self._args,**self._kwargs)
  File "<ipython-input-3-2942ae0a0a56>",line 5,in gen_model
    model = AlexNet(params['activation'],params['leaky_alpha'])
  File "<ipython-input-2-2a405202aa5a>",line 27,in AlexNet
    Dense(units = 2,activation=activation_layer)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py",line 94,in __init__
    self.add(layer)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py",line 162,in add
    name=layer.name + '_input')
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py",line 178,in Input
    input_tensor=tensor)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\legacy\interfaces.py",line 91,in wrapper
    return func(*args,**kwargs)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py",line 87,in __init__
    name=self.name)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py",line 73,in symbolic_fn_wrapper
    if _SYMBOLIC_ScopE.value:
AttributeError: '_thread._local' object has no attribute 'value'

TypeError: cannot unpack non-iterable nonetype object

我知道，我最后的机会是手动进行操作，但是我认为以后无论如何训练模型时，我都会朝着相同的问题前进。

非常感谢您照顾我的问题，阅读了我的问题并纠正了文本中的拼写错误^^。

我期待着从这个令人惊叹的社区中获得建设性的解决方案！（：

================================================ =========================

GPU：NVIDIA RTX 2080Ti和Titan Xp Collectors Edition（我都尝试过）

TensorFlow：2.1.0

Keras：2.3.1

Talos：1.0

解决方法

禁用 Eager Execution 为我解决了这个问题： tf.compat.v1.disable_eager_execution()

https://github.com/autonomio/talos/issues/482

hyperparameters keras keras python talos tensorflow tensorflow tensorflow