为什么基于gvisor的节点池无法正确引导?

问题描述

我正在尝试使用GKE中的gvisor沙箱配置新的节点池。我使用GCP Web控制台添加新的节点池,使用cos_containerd操作系统,并选中“启用gvisor沙箱”复选框,但是每次在GCP控制台通知中节点池置备失败并显示“未知错误”。这些节点永远不会加入K8S集群。

GCE VM似乎可以正常启动,当我在journalctl中查找该节点时,我发现cloud-init似乎已经完成,但是kubelet似乎无法执行开始。我看到这样的错误消息:

Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.184163    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.284735    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.385229    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.485626    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.522961    1143 eviction_manager.go:251] eviction manager: Failed to get summary stats: Failed to get node info: node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz containerd[976]: time="2020-10-12T16:58:07.576735750Z" level=error msg="Failed to load cni configuration" error="cni config load Failed: no network config found in /etc/cni/net.d: cni plugin not initialized: Failed to load cni config"
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.577353    1143 kubelet.go:2191] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.587824    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.989869    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:08 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:08.090287    1143 
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.296365    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.396933    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz node-problem-detector[1166]: F1012 16:58:09.449446    2481 main.go:71] cannot create certificate signing request: Post https://172.17.0.2/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?timeout=5m0s: dial tcp 172.17.0.2:443: connect: no route 
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz node-problem-detector[1166]: E1012 16:58:09.450695    1166 manager.go:162] Failed to update node conditions: Patch https://172.17.0.2/api/v1/nodes/gke-main-sanBoxes-dd9b8d84-dmzz/status: getting credentials: exec: exit status 1
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.453825    2486 cache.go:125] Failed reading existing private key: open /var/lib/kubelet/pki/kubelet-client.key: no such file or directory
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.543449    1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.556623    2486 tpm.go:124] Failed reading AIK cert: tpm2.NVRead(AIK cert): decoding NV_ReadPublic response: handle 1,error code 0xb : the handle is not correct for the use

我不太确定可能是什么原因造成的,我真的很想能够对此节点池使用自动缩放,因此我不想只是为此节点手动修复它,而必须这样做对于任何新加入的节点。如何配置节点池,以使基于gvisor的节点自行配置好?

我的集群详细信息:

  • GKE版本:1.17.9-gke.6300
  • 集群类型:区域
  • VPC原生
  • 私有集群
  • 屏蔽的GKE节点

解决方法

您可以通过以下链接报告Google产品的问题:

您需要在Create new Google Kubernetes Engine issue部分下选择Compute


我可以确定我在创建集群时偶然发现了相同的问题,如问题所述(私有,屏蔽等):

  • 创建具有一个节点池的集群。
  • 成功创建集群后,添加启用了gvisor的节点池。

如上所述创建集群将把GKE集群推到RECONCILING状态:

NAME        LOCATION      MASTER_VERSION   MASTER_IP       MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
gke-gvisor  europe-west3  1.17.9-gke.6300  XX.XXX.XXX.XXX  e2-medium     1.17.9-gke.6300  6          RECONCILING

集群状态的变化:

  • Provisoning-创建集群
  • Running-创建集群
  • Reconciling-添加了节点池
  • Running-已添加节点池(大约一分钟)
  • Reconciling-集群进入该状态大约25分钟

GCP云控制台(Web UI)报告:Repairing Cluster