无法使用自定义容器在 Google AI Platform 上加载动态库 libcuda.so.1 错误

问题描述

我正在尝试使用自定义容器在 Google AI Platform 上启动训练作业。由于我想使用 GPU 进行训练，因此我用于容器的基本图像是：

FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04

有了这张图片（并在其上安装了 tensorflow 2.4.1），我以为我可以在 AI Platform 上使用 GPU，但似乎并非如此。训练开始时，日志显示如下：

W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist
I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices,tf_xla_enable_xla_devices not set
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`,not using nccl allreduce.

这是构建图像以在 Google AI Platform 上使用 GPU 的好方法吗？或者我应该尝试依赖 tensorflow 图像并手动安装所有需要的驱动程序来利用 GPU？

编辑：我在这里 (https://cloud.google.com/ai-platform/training/docs/containers-overview) 阅读了以下内容：

For training with GPUs,your custom container needs to meet a few
special requirements. You must build a different Docker image than     
what you'd use for training with cpus.

Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the 
nvidia/cuda image as your base image is the recommended way to handle 
this. It has the matching versions of CUDA toolkit and cuDNN pre-
installed,and it helps you set up the related environment variables 
correctly.

Install your training application,along with your required ML     
framework and other dependencies in your Docker image.

他们还提供了一个 Dockerfile 示例 here，用于使用 GPU 进行训练。所以我所做的似乎没问题。不幸的是，我仍然有上面提到的这些错误，它们可以解释（或不能解释）为什么我不能在 Google AI Platform 上使用 GPU。

EDIT2：在此处阅读 (https://www.tensorflow.org/install/gpu) 我的 Dockerfile 现在是：

FROM tensorflow/tensorflow:2.4.1-gpu
RUN apt-get update && apt-get install -y \
 lsb-release \
 vim \
 curl \
 git \
 libgl1-mesa-dev \
 software-properties-common \
 wget && \
 rm -rf /var/lib/apt/lists/*

# Add NVIDIA package repositories
RUN wget -nv https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
RUN mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
RUN add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
RUN apt-get update

RUN wget -nv http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

RUN apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
RUN apt-get update

# Install NVIDIA driver
RUN apt-get install -y --no-install-recommends nvidia-driver-450
# Reboot. Check that GPUs are visible using the command: nvidia-smi

RUN wget -nv https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
RUN apt-get update

# Install development and runtime libraries (~4GB)
RUN apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0


# other stuff

问题是构建在似乎是键盘配置的阶段冻结。系统要求选择一个国家，当我输入号码时，没有任何反应

解决方法

构建最可靠容器的建议方法是使用官方维护的“深度学习容器”。我建议拉'gcr.io/deeplearning-platform-release/tf2-gpu.2-4'。这应该已经安装并测试了 CUDA、CUDNN、GPU 驱动程序和 TF 2.4。您只需将代码添加到其中即可。

cuda cuda docker docker google-ai-platform nvidia tensorflow tensorflow tensorflow