在审判1之后，Autokeras会消耗掉所有GPU

问题描述

在运行本书中的示例时，我遇到了 autokeras 的问题。任务是为使用MNIST数据集训练的模型生成体系结构（针对Autokeras的“ hello world”难度任务）。另外，我在使用笔记本电脑GPU时遇到问题，必须添加一些额外的代码才能显式使用GPU。

import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.python.keras.utils.data_utils import Sequence
import autokeras as ak

###### My special code here ##############
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)
##########################################

(x_train,y_train),(x_test,y_test) = mnist.load_data()

clf = ak.ImageClassifier(
    overwrite=True,max_trials=10)

##########################################
with tf.device('/gpu:0'):
##########################################
    clf.fit(x_train,y_train,epochs=2)

输出（历元等于2以获得更快的结果）：

    Trial 1 Complete [00h 00m 20s]
    val_loss: 0.058981552720069885
    
    Best val_loss So Far: 0.058981552720069885
    Total elapsed time: 00h 00m 20s
    
    Search: Running Trial #2
    
    Hyperparameter      |Value     |Best Value So Far   
    image_block_1/block_type|resnet    |vanilla             
    image_block_1/normalize|True      |True                
    image_block_1/augment|True      |False               
    image_block_1/image_augmentation_1/horizontal_flip|True      |None                
    image_block_1/image_augmentation_1/vertical_flip|False     |None                
    image_block_1/image_augmentation_1/contrast_factor|0.0       |None                
    image_block_1/image_augmentation_1/rotation_factor|0.0       |None                
    image_block_1/image_augmentation_1/translation_factor|0.1       |None                
    image_block_1/image_augmentation_1/zoom_factor|0.0       |None                
    image_block_1/res_net_block_1/pretrained|True      |None                
    image_block_1/res_net_block_1/version|resnet50  |None                
    image_block_1/res_net_block_1/trainable|True      |None                
    image_block_1/res_net_block_1/imagenet_size|True      |None                
    classification_head_1/spatial_reduction_1/reduction_type|global_avg|flatten             
    classification_head_1/dropout|0         |0.5                 
    optimizer           |adam      |adam                
    learning_rate       |1e-05     |0.001               
    
    Epoch 1/2
       2/1500 [..............................] - ETA: 5:31 - loss: 2.4616 - accuracy: 0.1562WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.1631s vs `on_train_batch_end` time: 0.2793s). Check your callbacks.
       3/150Trial 1 Complete [00h 00m 20s]
    val_loss: 0.058981552720069885
    
    Best val_loss So Far: 0.058981552720069885
    Total elapsed time: 00h 00m 20s
    
    Search: Running Trial #2
    
    Hyperparameter      |Value     |Best Value So Far   
    image_block_1/block_type|resnet    |vanilla             
    image_block_1/normalize|True      |True                
    image_block_1/augment|True      |False               
    image_block_1/image_augmentation_1/horizontal_flip|True      |None                
    image_block_1/image_augmentation_1/vertical_flip|False     |None                
    image_block_1/image_augmentation_1/contrast_factor|0.0       |None                
    image_block_1/image_augmentation_1/rotation_factor|0.0       |None                
    image_block_1/image_augmentation_1/translation_factor|0.1       |None                
    image_block_1/image_augmentation_1/zoom_factor|0.0       |None                
    image_block_1/res_net_block_1/pretrained|True      |None                
    image_block_1/res_net_block_1/version|resnet50  |None                
    image_block_1/res_net_block_1/trainable|True      |None                
    image_block_1/res_net_block_1/imagenet_size|True      |None                
    classification_head_1/spatial_reduction_1/reduction_type|global_avg|flatten             
    classification_head_1/dropout|0         |0.5                 
    optimizer           |adam      |adam                
    learning_rate       |1e-05     |0.001               
    
    Epoch 1/2
       2/1500 [..............................] - ETA: 5:31 - loss: 2.4616 - accuracy: 0.1562WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.1631s vs `on_train_batch_end` time: 0.2793s). Check your callbacks.
       3/1500 [..............................] - ETA: 7:11 - loss: 2.4400 - accuracy: 0.1667
    
    ---------------------------------------------------------------------------
    ResourceExhaustedError                    Traceback (most recent call last)
    <ipython-input-6-fc43cdbb1604> in <module>
          1 with tf.device('/gpu:0'):
    ----> 2     clf.fit(x_train,epochs=2)
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/tasks/image.py in fit(self,x,y,epochs,callbacks,validation_split,validation_data,**kwargs)
        152             **kwargs: Any arguments supported by keras.Model.fit.
        153         """
    --> 154         super().fit(
        155             x=x,156             y=y,~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/auto_model.py in fit(self,batch_size,**kwargs)
        277         )
        278 
    --> 279         self.tuner.search(
        280             x=dataset,281             epochs=epochs,~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/engine/tuner.py in search(self,fit_on_val_data,**fit_kwargs)
        136         self.oracle.update_space(hp)
        137 
    --> 138         super().search(epochs=epochs,callbacks=new_callbacks,**fit_kwargs)
        139 
        140         # Train the best model use validation data.
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in search(self,*fit_args,**fit_kwargs)
        129 
        130             self.on_trial_begin(trial)
    --> 131             self.run_trial(trial,**fit_kwargs)
        132             self.on_trial_end(trial)
        133         self.on_search_end()
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py in run_trial(self,trial,**fit_kwargs)
        151         self._on_train_begin(model,trial.hyperparameters,152                              *fit_args,**copied_fit_kwargs)
    --> 153         model.fit(*fit_args,**copied_fit_kwargs)
        154 
        155     def _on_train_begin(model,hp,**fit_kwargs):
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self,*args,**kwargs)
        106   def _method_wrapper(self,**kwargs):
        107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
    --> 108       return method(self,**kwargs)
        109 
        110     # Running inside `run_distribute_coordinator` already.
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self,verbose,shuffle,class_weight,sample_weight,initial_epoch,steps_per_epoch,validation_steps,validation_batch_size,validation_freq,max_queue_size,workers,use_multiprocessing)
       1096                 batch_size=batch_size):
       1097               callbacks.on_train_batch_begin(step)
    -> 1098               tmp_logs = train_function(iterator)
       1099               if data_handler.should_sync:
       1100                 context.async_wait()
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in __call__(self,**kwds)
        778       else:
        779         compiler = "nonXla"
    --> 780         result = self._call(*args,**kwds)
        781 
        782       new_tracing_count = self._get_tracing_count()
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in _call(self,**kwds)
        805       # In this case we have created variables on the first call,so we run the
        806       # defunned version which is guaranteed to never create variables.
    --> 807       return self._stateless_fn(*args,**kwds)  # pylint: disable=not-callable
        808     elif self._stateful_fn is not None:
        809       # Release the lock early so that multiple threads can perform the call
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in __call__(self,**kwargs)
       2827     with self._lock:
       2828       graph_function,args,kwargs = self._maybe_define_function(args,kwargs)
    -> 2829     return graph_function._filtered_call(args,kwargs)  # pylint: disable=protected-access
       2830 
       2831   @property
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _filtered_call(self,kwargs,cancellation_manager)
       1841       `args` and `kwargs`.
       1842     """
    -> 1843     return self._call_flat(
       1844         [t for t in nest.flatten((args,kwargs),expand_composites=True)
       1845          if isinstance(t,(ops.Tensor,~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _call_flat(self,captured_inputs,cancellation_manager)
       1921         and executing_eagerly):
       1922       # No tape is watching; skip to running the function.
    -> 1923       return self._build_call_outputs(self._inference_function.call(
       1924           ctx,cancellation_manager=cancellation_manager))
       1925     forward_backward = self._select_forward_and_backward_functions(
    
    ~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in call(self,ctx,cancellation_manager)
        543       with _InterpolateFunctionError(self):
        544         if cancellation_manager is None:
    --> 545           outputs = execute.execute(
        546               str(self.signature.name),547               num_outputs=self._num_outputs,~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name,num_outputs,inputs,attrs,name)
         57   try:
         58     ctx.ensure_initialized()
    ---> 59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle,device_name,op_name,60                                         inputs,num_outputs)
         61   except core._NotOkStatusException as e:
    
    ResourceExhaustedError:  OOM when allocating tensor with shape[65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node functional_1/global_average_pooling2d/Mean (defined at /home/biowar/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py:153) ]]
    Hint: If you want to see a list of allocated tensors when OOM happens,add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
     [Op:__inference_train_function_37301]
    
    Function call stack:
    train_function

0 [..............................] - ETA: 7:11 - loss: 2.4400 - accuracy: 0.1667

---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-6-fc43cdbb1604> in <module>
      1 with tf.device('/gpu:0'):
----> 2     clf.fit(x_train,epochs=2)

~/anaconda3/envs/ML/lib/python3.8/site-packages/autokeras/tasks/image.py in fit(self,**kwargs)
    152             **kwargs: Any arguments supported by keras.Model.fit.
    153         """
--> 154         super().fit(
    155             x=x,**kwargs)
    277         )
    278 
--> 279         self.tuner.search(
    280             x=dataset,**fit_kwargs)
    136         self.oracle.update_space(hp)
    137 
--> 138         super().search(epochs=epochs,**fit_kwargs)
    139 
    140         # Train the best model use validation data.

~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/base_tuner.py in search(self,**fit_kwargs)
    129 
    130             self.on_trial_begin(trial)
--> 131             self.run_trial(trial,**fit_kwargs)
    132             self.on_trial_end(trial)
    133         self.on_search_end()

~/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py in run_trial(self,**fit_kwargs)
    151         self._on_train_begin(model,**copied_fit_kwargs)
--> 153         model.fit(*fit_args,**copied_fit_kwargs)
    154 
    155     def _on_train_begin(model,**fit_kwargs):

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self,**kwargs)
    106   def _method_wrapper(self,**kwargs):
    107     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
--> 108       return method(self,**kwargs)
    109 
    110     # Running inside `run_distribute_coordinator` already.

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in fit(self,use_multiprocessing)
   1096                 batch_size=batch_size):
   1097               callbacks.on_train_batch_begin(step)
-> 1098               tmp_logs = train_function(iterator)
   1099               if data_handler.should_sync:
   1100                 context.async_wait()

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in __call__(self,**kwds)
    778       else:
    779         compiler = "nonXla"
--> 780         result = self._call(*args,**kwds)
    781 
    782       new_tracing_count = self._get_tracing_count()

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py in _call(self,**kwds)
    805       # In this case we have created variables on the first call,so we run the
    806       # defunned version which is guaranteed to never create variables.
--> 807       return self._stateless_fn(*args,**kwds)  # pylint: disable=not-callable
    808     elif self._stateful_fn is not None:
    809       # Release the lock early so that multiple threads can perform the call

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in __call__(self,**kwargs)
   2827     with self._lock:
   2828       graph_function,kwargs)
-> 2829     return graph_function._filtered_call(args,kwargs)  # pylint: disable=protected-access
   2830 
   2831   @property

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in _filtered_call(self,cancellation_manager)
   1841       `args` and `kwargs`.
   1842     """
-> 1843     return self._call_flat(
   1844         [t for t in nest.flatten((args,expand_composites=True)
   1845          if isinstance(t,cancellation_manager)
   1921         and executing_eagerly):
   1922       # No tape is watching; skip to running the function.
-> 1923       return self._build_call_outputs(self._inference_function.call(
   1924           ctx,cancellation_manager=cancellation_manager))
   1925     forward_backward = self._select_forward_and_backward_functions(

~/anaconda3/envs/ML/lib/python3.8/site-packages/tensorflow/python/eager/function.py in call(self,cancellation_manager)
    543       with _InterpolateFunctionError(self):
    544         if cancellation_manager is None:
--> 545           outputs = execute.execute(
    546               str(self.signature.name),name)
     57   try:
     58     ctx.ensure_initialized()
---> 59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle,num_outputs)
     61   except core._NotOkStatusException as e:

ResourceExhaustedError:  OOM when allocating tensor with shape[65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node functional_1/global_average_pooling2d/Mean (defined at /home/biowar/anaconda3/envs/ML/lib/python3.8/site-packages/kerastuner/engine/tuner.py:153) ]]
Hint: If you want to see a list of allocated tensors when OOM happens,add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_train_function_37301]

Function call stack:
train_function

nvidia-smi的输出（在试验1 中）：

Every 0,5s: nvidia-smi                                                   Nitro5: Sun Aug 30 12:59:30 2020

Sun Aug 30 12:59:31 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 165...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0    32W /  N/A |   1101MiB /  3911MiB |     41%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1691      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2362      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      7530      C   ...conda3/envs/ML/bin/python      251MiB |
|    0   N/A  N/A     37376      C   ...conda3/envs/ML/bin/python      837MiB |
+-----------------------------------------------------------------------------+

nvidia-smi的输出（在试用2 开始之后）：

Every 0,5s: nvidia-smi                                                   Nitro5: Sun Aug 30 12:58:02 2020

Sun Aug 30 12:58:02 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 165...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P8     1W /  N/A |   3885MiB /  3911MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1691      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2362      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      7530      C   ...conda3/envs/ML/bin/python      251MiB |
|    0   N/A  N/A     35239      C   ...conda3/envs/ML/bin/python     3621MiB |
+-------------------------------------------------------------------------

问题：成功完成 Trial 1 后，如何修改代码以防止使用100％的GPU？谢谢大家的回答）

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

auto-keras deep-learning keras python tensorflow