问题描述
我正在尝试使用GKE中的gvisor沙箱配置新的节点池。我使用GCP Web控制台添加新的节点池,使用cos_containerd
操作系统,并选中“启用gvisor沙箱”复选框,但是每次在GCP控制台通知中节点池置备失败并显示“未知错误”。这些节点永远不会加入K8S集群。
GCE VM似乎可以正常启动,当我在journalctl
中查找该节点时,我发现cloud-init
似乎已经完成,但是kubelet似乎无法执行开始。我看到这样的错误消息:
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.184163 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.284735 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.385229 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.485626 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.522961 1143 eviction_manager.go:251] eviction manager: Failed to get summary stats: Failed to get node info: node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz containerd[976]: time="2020-10-12T16:58:07.576735750Z" level=error msg="Failed to load cni configuration" error="cni config load Failed: no network config found in /etc/cni/net.d: cni plugin not initialized: Failed to load cni config"
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.577353 1143 kubelet.go:2191] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.587824 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.989869 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:08 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:08.090287 1143
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.296365 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.396933 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz node-problem-detector[1166]: F1012 16:58:09.449446 2481 main.go:71] cannot create certificate signing request: Post https://172.17.0.2/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?timeout=5m0s: dial tcp 172.17.0.2:443: connect: no route
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz node-problem-detector[1166]: E1012 16:58:09.450695 1166 manager.go:162] Failed to update node conditions: Patch https://172.17.0.2/api/v1/nodes/gke-main-sanBoxes-dd9b8d84-dmzz/status: getting credentials: exec: exit status 1
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.453825 2486 cache.go:125] Failed reading existing private key: open /var/lib/kubelet/pki/kubelet-client.key: no such file or directory
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.543449 1143 kubelet.go:2271] node "gke-main-sanBoxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanBoxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.556623 2486 tpm.go:124] Failed reading AIK cert: tpm2.NVRead(AIK cert): decoding NV_ReadPublic response: handle 1,error code 0xb : the handle is not correct for the use
我不太确定可能是什么原因造成的,我真的很想能够对此节点池使用自动缩放,因此我不想只是为此节点手动修复它,而必须这样做对于任何新加入的节点。如何配置节点池,以使基于gvisor的节点自行配置好?
我的集群详细信息:
- GKE版本:1.17.9-gke.6300
- 集群类型:区域
- VPC原生
- 私有集群
- 屏蔽的GKE节点
解决方法
您可以通过以下链接报告Google产品的问题:
您需要在Create new Google Kubernetes Engine issue
部分下选择Compute
。
我可以确定我在创建集群时偶然发现了相同的问题,如问题所述(私有,屏蔽等):
- 创建具有一个节点池的集群。
- 成功创建集群后,添加启用了
gvisor
的节点池。
如上所述创建集群将把GKE
集群推到RECONCILING
状态:
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
gke-gvisor europe-west3 1.17.9-gke.6300 XX.XXX.XXX.XXX e2-medium 1.17.9-gke.6300 6 RECONCILING
集群状态的变化:
-
Provisoning
-创建集群 -
Running
-创建集群 -
Reconciling
-添加了节点池 -
Running
-已添加节点池(大约一分钟) -
Reconciling
-集群进入该状态大约25分钟
GCP云控制台(Web UI)报告:Repairing Cluster