问题描述
我正在尝试根据官方文档(this和this)安装最新版本的NVIDIA Clara Deploy Bootstrap。在安装的第一步,它们是一个名为“ bootstrap.sh”的shellscript-用于安装所有依赖项,包括Kubernetes和kubectl,以及创建集群。但是在运行sudo ./bootstrap.sh
时,出现以下错误:error: the server doesn't have a resource type "pods"
。
我到目前为止所做的事情:
我对Kubernetes相当陌生。因此,我尝试了this answer的解决方案,并尝试运行kubectl get pods
,这给了我No resources found.
。我也尝试过kubectl auth can-i get pods
,这给了我yes
。在etc / kubernetes / manifests内部,它是空的,应该包含我从答案中查找的conf文件,因此我运行了sudo kubeadm init
。
以下是完整的错误消息:
2020-10-17 20:57:37 [INFO]: Clara Deploy SDK System Prerequisites Installation
2020-10-17 20:57:37 [INFO]: Checking user privilege...
2020-10-17 20:57:37 [INFO]: Checking for NVIDIA GPU driver...
2020-10-17 20:57:37 [INFO]: NVIDIA CUDA driver version found: 418.87.01
2020-10-17 20:57:37 [INFO]: NVIDIA GPU driver found
2020-10-17 20:57:37 [INFO]: Check and install required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-release
dirmngr jq ...
Ign:1 http://deb.debian.org/debian stretch InRelease
Get:2 http://security.debian.org stretch/updates InRelease [53.0 kB]
Get:3 http://deb.debian.org/debian stretch-updates InRelease [93.6 kB]
Get:4 http://deb.debian.org/debian stretch-backports InRelease [91.8 kB]
Hit:5 http://deb.debian.org/debian stretch Release
Hit:6 http://packages.cloud.google.com/apt gcsfuse-stretch InRelease
Get:7 https://download.docker.com/linux/debian stretch InRelease [44.8 kB]
Get:8 http://packages.cloud.google.com/apt cloud-sdk-stretch InRelease [6,389 B]
Get:9 http://security.debian.org stretch/updates/main Sources [263 kB]
Hit:10 http://packages.cloud.google.com/apt google-compute-engine-stretch-stable InRelease
Get:11 http://security.debian.org stretch/updates/main amd64 Packages [604 kB]
Get:12 http://security.debian.org stretch/updates/main Translation-en [267 kB]
Hit:13 http://packages.cloud.google.com/apt google-cloud-packages-archive-keyring-stretch InRelease
Hit:14 https://nvidia.github.io/libnvidia-container/stable/debian9/amd64 InRelease
Hit:16 https://nvidia.github.io/nvidia-container-runtime/stable/debian9/amd64 InRelease
Hit:15 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Hit:18 https://nvidia.github.io/nvidia-docker/debian9/amd64 InRelease
Fetched 1,424 kB in 1s (1,175 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
apt-transport-https is already the newest version (1.4.10).
ca-certificates is already the newest version (20200601~deb9u1).
dirmngr is already the newest version (2.1.18-8~deb9u4).
jq is already the newest version (1.5+dfsg-1.3).
lsb-release is already the newest version (9.20161125).
network-manager is already the newest version (1.6.2-3+deb9u2).
unzip is already the newest version (6.0-21+deb9u2).
curl is already the newest version (7.52.1-5+deb9u12).
software-properties-common is already the newest version (0.96.20.2-1+deb9u1).
0 upgraded,0 newly installed,0 to remove and 22 not upgraded.
2020-10-17 20:57:41 [INFO]: Starting network-manager service...
2020-10-17 20:57:41 [INFO]: Successfully installed required packages: apt-transport-https ca-certificates curl software-properties-common network-manager unzip lsb-re
lease dirmngr jq !
2020-10-17 20:57:41 [INFO]: disabling swap ...
2020-10-17 20:57:41 [INFO]: Start installing docker and nvidia-docker2 ...
2020-10-17 20:57:41 [INFO]: 'proteeti_prova' is already added to docker group. Skipping docker group configuration ...
2020-10-17 20:57:41 [INFO]: Skipping nvidia-docker install since it is already present.
WARNING: No swap limit support
2020-10-17 20:57:42 [INFO]: Docker Compose version 1.25.4 is already installed. Skipping docker-compose installation...
2020-10-17 20:57:42 [INFO]: The following versions of k8s components are already installed.
Error from server (NotFound): the server Could not find the requested resource
2020-10-17 20:57:43 [INFO]: - kubectl: Client Version: v1.15.4
2020-10-17 20:57:43 [INFO]: - kubelet: Kubernetes v1.15.4
2020-10-17 20:57:44 [INFO]: - kubeadm: v1.15.4
2020-10-17 20:57:45 [INFO]: Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
error: the server doesn't have a resource type "pods"
解决方法
1。实例:
GCP,Ubuntu 18.04
n1-standard-16 (16 vCPUs,60 GB memory)
1 x NVIDIA Tesla T4
2。。下载引导程序并解压缩:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_bootstrap/versions/0.7.1-2008.1/files/bootstrap.zip
$unzip bootstrap.zip -d bootstrap
3。。先安装cuda并重新启动:
$wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
$sudo apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
$sudo apt-get update
$sudo apt-get -y install cuda
$sudo reboot
4。。重新启动后启用IP Forwarding:
$sudo -s
#echo 1 > /proc/sys/net/ipv4/ip_forward
5。。(第一次)运行bootstrap.sh
。
kubelet.service
显示code=exited,status=255
错误:
$sudo ./bootstrap/bootstrap.sh
...
...
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Mon 2020-10-19 10:40:54 UTC; 2s ago
Docs: https://kubernetes.io/docs/home/
Process: 2356 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited,status=255)
Main PID: 2356 (code=exited,status=255)
此错误表示您应该手动运行kubeadm init
。因此,运行kubeadm init --pod-network-cidr=10.244.0.0/16
,然后再次检查sudo service kubelet status
以确保它按预期运行。所有的kubernetes配置都将在kubeadm init --pod-network-cidr=10.244.0.0/16
期间为您生成。
6。。我们添加--pod-network-cidr=10.244.0.0/16
是因为我们将使用Flannel CNI。您可以在bootstrap.sh
的第334行if ! sudo kubeadm init --pod-network-cidr="10.244.0.0/16"; then
$ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.15.12
[preflight] Pulling images required for setting up a Kubernetes cluster
...
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Activating the kubelet service
...
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
...
[apiclient] All control plane components are healthy after 19.501975 seconds
...
Your Kubernetes control-plane has initialized successfully!.
...
$ sudo service kubelet status
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2020-10-19 13:42:22 UTC; 4min 15s ago
7。。下一步是常规步骤,可以从您的用户而不是root
$mkdir -p $HOME/.kube
$sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$sudo chown $(id -u):$(id -g) $HOME/.kube/config
8。。显示当前安装的所有内容
$ kubectl get all -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 0/1 Pending 0 4m17s
kube-system pod/coredns-5c98db65d4-kgzg8 0/1 Pending 0 4m17s
kube-system pod/etcd-clara 1/1 Running 0 3m10s
kube-system pod/kube-apiserver-clara 1/1 Running 0 3m35s
kube-system pod/kube-controller-manager-clara 1/1 Running 0 3m17s
kube-system pod/kube-proxy-8qx4z 1/1 Running 0 4m18s
kube-system pod/kube-scheduler-clara 1/1 Running 0 3m23s
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4m35s
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 4m34s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-proxy 1 1 1 1 1 beta.kubernetes.io/os=linux 4m33s
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
请注意:当前coredns pods
处于Pending
状态。您还可以看到尚未准备就绪的coredns deployment
和replicaset
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4m34s
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 0 4m18s
他们一直等到您将应用法兰绒配置yaml。 这些是来自同一脚本的行
info "Deploy kubernetes pod network."
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel.yml
sudo kubectl apply -f $SCRIPT_DIR/kube-flannel-rbac.yml
如果您现在不执行此操作并重新运行脚本,则会收到超时错误
2020-10-19 14:14:03 [INFO]: coredns pods are not running yet ...
9。。部署法兰绒
$ kubectl apply -f bootstrap/kube-flannel.yml
podsecuritypolicy.extensions/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.extensions/kube-flannel-ds-amd64 created
daemonset.extensions/kube-flannel-ds-arm64 created
daemonset.extensions/kube-flannel-ds-arm created
daemonset.extensions/kube-flannel-ds-ppc64le created
daemonset.extensions/kube-flannel-ds-s390x created
$ kubectl apply -f bootstrap/kube-flannel-rbac.yml
clusterrole.rbac.authorization.k8s.io/flannel configured
clusterrolebinding.rbac.authorization.k8s.io/flannel unchanged
此后,与coredns
相关的所有内容将立即开始工作。 Pods
将被创建并处于Running
状态,deployment
和replicaset
将处于正确状态。
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5c98db65d4-cpz4s 1/1 Running 0 21m
kube-system pod/coredns-5c98db65d4-kgzg8 1/1 Running 0 21m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 2/2 2 2 21m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5c98db65d4 2 2 2 21m
此外,您还会看到与法兰绒相关的新pod
和daemonsets
kube-system pod/kube-flannel-ds-amd64-64jbv 1/1 Running 0 3m59s
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system daemonset.apps/kube-flannel-ds-amd64 1 1 1 1 1 beta.kubernetes.io/arch=amd64 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm 0 0 0 0 0 beta.kubernetes.io/arch=arm 3m59s
kube-system daemonset.apps/kube-flannel-ds-arm64 0 0 0 0 0 beta.kubernetes.io/arch=arm64 3m59s
kube-system daemonset.apps/kube-flannel-ds-ppc64le 0 0 0 0 0 beta.kubernetes.io/arch=ppc64le 3m59s
kube-system daemonset.apps/kube-flannel-ds-s390x 0 0 0 0 0 beta.kubernetes.io/arch=s390x 3m59s
10。。终于可以继续运行脚本了。它会尝试!!!安装helm
,tiller
并重新启动dockerd
。除了TILLER
...
$sudo ./bootstrap/bootstrap.sh
[INFO]: Clara Deploy SDK System Prerequisites Installation
...
Skipping Kubernetes installation (version: 1.15.4-00) since Kubernetes is already present.
./bootstrap/bootstrap.sh: line 412: helm: command not found
...
[INFO]: Start installing helm ...
...
[INFO]: Restarting dockerd...
The connection to the server *.*.*.*:6443 was refused - did you specify the right host or port?
[INFO]: Waiting for Kubernetes to be ready...
Kubernetes master is running at https://*.*.*.*:6443
KubeDNS is running at https://*.*.*.*:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
...
[INFO]: Updating permissions...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
[INFO]: tiller pod is not started yet ...
11。。我们没有Tiller吊舱。结果,部署和副本集也被破坏了……
kube-system deployment.apps/tiller-deploy 0/1 0 0 7m26s
kube-system replicaset.apps/tiller-deploy-659c6788f5 1 0 0 7m26s
我在这里没有看到其他解决方案,而是手动删除分till的相关组件(部署,服务)并从头开始安装。.采用小解决方法。
#delete tiller
$kubectl delete deployment tiller-deploy -n kube-system
$kubectl delete deployment tiller-deploy -n kube-system
#install helm,tiller
$curl https://raw.githubusercontent.com/helm/helm/master/scripts/get | bash
$kubectl create serviceaccount --namespace kube-system tiller
$kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
$helm init --service-account tiller
现在,如果您要检查已部署的内容-您将清楚地看到tiller-pod
处于待处理状态,就像tiller-deploy
部署尚未准备就绪
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 0/1 Pending 0 11m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 0/1 1 0 11m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 0 11m
12。固定耕作机
让我们描述分till荚并找到tolerations
$ kubectl describe pod tiller-deploy-67847cd9b9-vlzm6 -n kube-system
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
我不会解释原因(您将自己了解容差),但解决方法是允许主运行容器...
$kubectl taint nodes --all node-role.kubernetes.io/master-
之后,您将看到
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/tiller-deploy-67847cd9b9-vlzm6 1/1 Running 0 13m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/tiller-deploy 1/1 1 1 13m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/tiller-deploy-67847cd9b9 1 1 1 13m
13。。接下来,安装所有组件:
$curl -LO https://api.ngc.nvidia.com/v2/resources/nvidia/clara/clara_cli/versions/0.7.1-2008.1/files/cli.zip
$sudo unzip cli.zip -d /usr/bin/ && sudo chmod 755 /usr/bin/clara*
$ clara version
Clara CLI version: 0.7.1-12788.ae65aea0
$ clara config --key KEY --orgteam nvidia/clara -y
Configuration "ngc-clara"successfully created
$ clara pull platform
Clara Platform 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara
$ clara platform start
Starting clara...
NAME: clara
$ clara pull dicom
Clara Dicom Adapter 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/dicom-adapter
$ clara pull render
Clara Renderer 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-renderer
$ clara pull monitor
Clara Monitor Server 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-monitor-server
$ clara pull console
Clara Management Console 0.7.1-2008.1
Chart saved at: /home/YOUR_USER/.clara/charts/clara-console
$ clara dicom start
Starting DICOM Adapter...
NAME: clara-dicom-adapter
$ clara render start
NAME: clara-render-server
$ clara monitor start
NAME: clara-monitor-server
$ clara console start
NAME: clara-console
14。。要验证安装是否成功,请运行以下命令:
$ helm ls
NAME REVISION UPDATED STATUS CHART APP VERSION NAMESPACE
clara 1 Mon Oct 19 16:16:36 2020 DEPLOYED clara-0.7.1-2008.1 1.0 default
clara-console 1 Mon Oct 19 16:28:30 2020 DEPLOYED clara-console-0.7.1-2008.1 1.0 default
clara-dicom-adapter 1 Mon Oct 19 16:22:36 2020 DEPLOYED dicom-adapter-0.7.1-2008.1 1.0 default
clara-monitor-server 1 Mon Oct 19 16:26:35 2020 DEPLOYED clara-monitor-server-0.7.1-2008.1 1.0 default
clara-render-server 1 Mon Oct 19 16:22:54 2020 DEPLOYED clara-renderer-0.7.1-2008.1 1.0 default
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
clara-clara-platformapiserver-54c5c44bbd-gqdd6 1/1 Running 0 13m
clara-console-8565b4d565-wcbg5 2/2 Running 0 2m2s
clara-console-mongodb-85f8bd5f95-ts2gp 1/1 Running 0 2m2s
clara-dicom-adapter-7948fcd445-mnsjd 1/1 Running 0 7m56s
clara-monitor-server-fluentd-elasticsearch-6zvhq 1/1 Running 0 3m57s
clara-monitor-server-grafana-5f874b974d-6l4s8 1/1 Running 0 3m57s
clara-monitor-server-monitor-server-59c8bf68f7-5dgxq 1/1 Running 0 3m57s
clara-render-server-clara-renderer-d79dd4779-wcjrv 3/3 Running 0 7m38s
clara-resultsservice-664477898f-9nk4f 1/1 Running 0 13m
clara-ui-6f89b97df8-792f6 1/1 Running 0 13m
clara-workflow-controller-69cbb55fc8-zjhdm 1/1 Running 0 13m
elasticsearch-master-0 1/1 Running 0 3m57s
elasticsearch-master-1 1/1 Running 0 3m57s
fluentd-km8nj 1/1 Running 0 13m
P.S。当然,为您修复脚本要容易得多,但是我决定向您展示后台发生了什么。我确定如果需要的话,您会自己做。