从Kubernetes容器内的主机节点上重新启动Docker守护程序

问题描述

目标:在GKE上重新启动Docker守护进程

问题:无法连接到总线

背景 在使用Google Kubernetes Engine(GKE)时,我试图重新启动主机节点的Docker守护进程,以便在具有GPU的节点上启用Nvidia GPU Telemetry for Kubernetes。我已经正确地隔离了GPU节点,并且可以按照Automatically bootstrapping Kubernetes Engine nodes with daemonSets指南通过DaemonSet运行initContainer来在主机节点上运行每个命令。

但是,在运行时期间,以下pod不允许我连接到Docker守护程序:

apiVersion: v1
kind: Pod
metadata:
  name: debug
  namespace: gpu-monitoring
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: Exists
  containers:
  - command:
    - sleep
    - "86400"
    env:
    - name: ROOT_MOUNT_DIR
      value: /root
    image: docker.io/ubuntu:18.04
    imagePullPolicy: IfNotPresent
    name: node-initializer
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /root
      name: root
    - mountPath: /scripts
      name: entrypoint
    - mountPath: /run
      name: run
  volumes:
  - hostPath:
      path: /
      type: ""
    name: root
  - configMap:
      defaultMode: 484
      name: nvidia-container-toolkit-installer-entrypoint
    name: entrypoint
  - hostPath:
      path: /run
      type: ""
    name: run

用户为0,而/run/user中存在的用户为10031002

为了验证与根Kubernetes(k8s)节点的连通性和交互,运行以下命令:

root@debug:/# chroot "${ROOT_MOUNT_DIR}" ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 226124  9816 ?        Ss   Oct13   0:27 /sbin/init

问题

两个图像

当尝试与基础Kubernetes(k8s)节点进行交互以重新启动Docker守护程序时,我得到以下信息:

root@debug:/# ls /run/dbus

system_bus_socket

root@debug:/# ROOT_MOUNT_DIR="${ROOT_MOUNT_DIR:-/root}"
root@debug:/# chroot "${ROOT_MOUNT_DIR}" systemctl status docker

Failed to connect to bus: No data available

尝试在主机节点上启动dbus时:

root@debug:/# export XDG_RUNTIME_DIR=/run/user/`id -u`
root@debug:/# export DBUS_SESSION_BUS_ADDRESS="unix:path=${XDG_RUNTIME_DIR}/bus"
root@debug:/# chroot "${ROOT_MOUNT_DIR}" /etc/init.d/dbus start

Failed to connect to bus: No data available

图片:solita / ubuntu-systemd

当尝试使用相同的k8s pod配置运行命令时,除了solita/ubuntu-systemd映像内部,以下是结果:

root@debug:/# /etc/init.d/dbus start
[....] Starting dbus (via systemctl): dbus.serviceRunning in chroot,ignoring request: start
. ok 

尝试的配置变化 我试图将几乎所有组合的以下内容更改为无效:

  • docker.io/solita/ubuntu-systemd:18.04的图片
  • 添加shareProcessNamespace: true
  • 添加以下安装:/dev/proc/sys
  • /run限制为/run/dbus/run/systemd

解决方法

因此,答案是一个未完全预期的怪异解决方法。为了重新启动Docker守护程序,请先在防火墙上打孔,以便Pod连接到主机节点。接下来,使用gcloud compute ssh,并ssh进入节点并通过远程ssh命令重新启动:

apt-get update
apt-get install -y \
  apt-transport-https \
  curl \
  gnupg \
  lsb-release \
  ssh

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb https://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
apt-get update
apt-get install -y google-cloud-sdk

CLUSTER_NAME="$(curl -sS http://metadata/computeMetadata/v1/instance/attributes/cluster-name -H "Metadata-Flavor: Google")"
NODE_NAME="$(curl -sS http://metadata.google.internal/computeMetadata/v1/instance/name -H 'Metadata-Flavor: Google')"
FULL_ZONE="$(curl -sS http://metadata.google.internal/computeMetadata/v1/instance/zone -H 'Metadata-Flavor: Google' | awk -F  "/" '{print $4}')"
MAIN_ZONE=$(echo $FULL_ZONE | sed 's/\(.*\)-.*/\1/')

gcloud compute ssh \
  --internal-ip $NODE_NAME \
  --zone=$FULL_ZONE \
  -- "sudo systemctl restart docker"

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...