如何排除故障：Kubernetes Pod 未创建或终止

问题描述

我是 K8s 的新手，所以我无法找到问题的按钮。上周我用kubeadm在centos中安装了一个1主2节点的集群：

kubectl 获取节点

NAME             STATUS   ROLES                  AGE    VERSION
ardl-k8latam01   Ready    control-plane,master   7d2h   v1.20.0
ardl-k8latam02   Ready    <none>                 7d2h   v1.20.0
ardl-k8latam03   Ready    <none>                 7d2h   v1.20.0

起初工作正常，但在我开始使用 helm 后开始失败（不知道是否相关）。现在我无法运行任何部署，并且有很多处于“终止”状态的 Pod 永远不会完成。这里我尝试以 kubectl apply -f https://k8s.io/examples/controllers/Nginx-deployment.yaml 为例：

[root@ardl-k8latam01 ~]# kubectl get all --all-namespaces
NAMESPACE     NAME                                          READY   STATUS        RESTARTS   AGE
default       pod/Nginx-deployment-66b6c48dd5-2xt7b         1/1     Terminating   0          19h
default       pod/Nginx-deployment-66b6c48dd5-5cttk         1/1     Terminating   0          19h
default       pod/Nginx-deployment-66b6c48dd5-8bz2f         0/1     Pending       0          18h
default       pod/Nginx-deployment-66b6c48dd5-dksqx         1/1     Terminating   0          19h
default       pod/Nginx-deployment-66b6c48dd5-fj9kl         0/1     Pending       0          18h
default       pod/Nginx-deployment-66b6c48dd5-j4hqv         0/1     Pending       0          18h
kube-system   pod/calico-kube-controllers-bcc6f659f-bgmkb   1/1     Running       0          18h
kube-system   pod/calico-kube-controllers-bcc6f659f-pksws   1/1     Terminating   0          7d21h
kube-system   pod/calico-node-fns6d                         0/1     Running       2          7d21h
kube-system   pod/calico-node-t854c                         1/1     Running       0          7d21h
kube-system   pod/calico-node-vbsdr                         1/1     Running       0          7d21h
kube-system   pod/coredns-74ff55c5b-gw8j2                   1/1     Running       1          18h
kube-system   pod/coredns-74ff55c5b-xhvqb                   1/1     Terminating   0          7d21h
kube-system   pod/coredns-74ff55c5b-xr9mb                   1/1     Terminating   0          7d21h
kube-system   pod/coredns-74ff55c5b-zhhkx                   1/1     Running       1          18h
kube-system   pod/etcd-ardl-k8latam01                       1/1     Running       2          7d21h
kube-system   pod/kube-apiserver-ardl-k8latam01             1/1     Running       4          7d21h
kube-system   pod/kube-controller-manager-ardl-k8latam01    1/1     Running       2          7d21h
kube-system   pod/kube-proxy-2lmpb                          1/1     Running       0          7d21h
kube-system   pod/kube-proxy-fchv8                          1/1     Running       2          7d21h
kube-system   pod/kube-proxy-xks7h                          1/1     Running       0          7d21h
kube-system   pod/kube-scheduler-ardl-k8latam01             1/1     Running       2          7d21h
kube-system   pod/metrics-server-68b849498d-6q74v           1/1     Terminating   0          7d20h
kube-system   pod/metrics-server-68b849498d-7lpz8           0/1     Pending       0          18h

NAMESPACE     NAME                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default       service/dashboardlb      ClusterIP   10.100.82.105   <none>        8001/TCP                 7d20h
default       service/kubernetes       ClusterIP   10.96.0.1       <none>        443/TCP                  7d21h
kube-system   service/kube-dns         ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   7d21h
kube-system   service/metrics-server   ClusterIP   10.101.85.63    <none>        443/TCP                  7d20h

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
kube-system   daemonset.apps/calico-node   3         3         0       3            0           beta.kubernetes.io/os=linux   7d21h
kube-system   daemonset.apps/kube-proxy    3         3         1       3            1           kubernetes.io/os=linux        7d21h

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
default       deployment.apps/Nginx-deployment          0/3     3            0           18h
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           7d21h
kube-system   deployment.apps/coredns                   2/2     2            2           7d21h
kube-system   deployment.apps/metrics-server            0/1     1            0           7d20h

NAMESPACE     NAME                                                DESIRED   CURRENT   READY   AGE
default       replicaset.apps/Nginx-deployment-66b6c48dd5         3         3         0       18h
kube-system   replicaset.apps/calico-kube-controllers-bcc6f659f   1         1         1       7d21h
kube-system   replicaset.apps/coredns-74ff55c5b                   2         2         2       7d21h
kube-system   replicaset.apps/metrics-server-68b849498d           1         1         0       7d20h

在集群信息转储中我得到：

==== START logs for container second-node of pod default/second-app-deployment-7f794d896f-q6zn5 ====
Request log error: the server rejected our request for an unkNown reason (get pods second-app-deployment-7f794d896f-q6zn5)
==== END logs for container second-node of pod default/second-app-deployment-7f794d896f-q6zn5 ====

用描述：

[root@ardl-k8latam01 testwordpress]# kubectl describe pod Nginx-deployment-66b6c48dd5-5cttk
Name:           Nginx-deployment-66b6c48dd5-5cttk
Namespace:      default
Priority:       0
Node:           ardl-k8latam02/10.48.41.12
Start Time:     Fri,18 Dec 2020 17:06:57 -0300
Labels:         app=Nginx
                pod-template-hash=66b6c48dd5
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  replicaset/Nginx-deployment-66b6c48dd5
Containers:
  Nginx:
    Container ID:
    Image:          Nginx:1.14.2
    Image ID:
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9rnk6 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-9rnk6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-9rnk6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From               Message
  ----     ------                  ----                  ----               -------
  Warning  FailedCreatePodSandBox  22m                   kubelet            Failed to create pod **sandBox: rpc error: code = UnkNown desc = [Failed to set up sandBox container "044a2201b141e6679570d0f0ec3b1967b2a5bf0b230fa5058ed2bc6711eba55e" network for pod "Nginx-deployment-66b6c48dd5-5cttk": networkPlugin cni Failed to set up pod "Nginx-deployment-66b6c48dd5-5cttk_default" network: error getting Clusterinformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: connect: no route to host,Failed to clean up sandBox container "044a2201b141e6679570d0f0ec3b1967b2a5bf0b230fa5058ed2bc6711eba55e" network for pod "Nginx-deployment-66b6c48dd5-5cttk": networkPlugin cni Failed to teardown pod "Nginx-deployment-66b6c48dd5-5cttk_default" network: error getting Clusterinformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: connect: no route to host]
  normal   Scheduled               21m                   default-scheduler  Successfully assigne**d default/Nginx-deployment-66b6c48dd5-5cttk to ardl-k8latam02
  normal   SandBoxChanged          2m27s (x93 over 22m)  kubelet            Pod sandBox changed,it will be killed and re-created.

我也尝试重新启动节点和主节点，但没有任何改变。当我尝试“描述”一个“终止”pod 时，它告诉我该 pod 不存在。

我的问题与印花布有关吗？如何深入了解Request log error: the server rejected our request for an unkNown reason？
我该如何继续调查？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

kubernetes kubernetes-helm project-calico