问题描述
我正在通过集群 API 在 openstack 顶部配置一个具有一个控制平面节点和一个工作节点的工作负载集群。但是,kubernetes 控制平面在控制平面节点中无法正常启动。
我可以看到 kube-apiserver 不断退出并重新创建:
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$ sudo crictl --runtime-endpoint /run/containerd/containerd.sock ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
a729fdd387b0a 90d27391b7808 About a minute ago Running kube-apiserver 74 88de61a0459f6
38b54a71cb0aa 90d27391b7808 3 minutes ago Exited kube-apiserver 73 88de61a0459f6
24573a1c5adc5 b0f1517c1f4bb 18 minutes ago Running kube-controller-manager 4 cc113aaae13b5
a2072b64cca1a b0f1517c1f4bb 29 minutes ago Exited kube-controller-manager 3 cc113aaae13b5
f26a531972518 d109c0821a2b9 5 hours ago Running kube-scheduler 1 df1d15fd61a8f
a91b4c0ce9e27 303ce5db0e90d 5 hours ago Running etcd 1 16e1f0f5bb543
1565a1a7dedec 303ce5db0e90d 5 hours ago Exited etcd 0 16e1f0f5bb543
35ae23eb64f11 d109c0821a2b9 5 hours ago Exited kube-scheduler 0 df1d15fd61a8f
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$
从 kube-apiserver 容器的日志中我可以看到“http: TLS handshake error from 172.24.4.159:50812: EOF”:
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$ sudo crictl --runtime-endpoint /run/containerd/containerd.sock logs -f a729fdd387b0a
Flag --insecure-port has been deprecated,This flag will be removed in a future version.
I0416 20:32:25.730809 1 server.go:596] external host was not specified,using 10.6.0.9
I0416 20:32:25.744220 1 server.go:150] Version: v1.17.3
......
......
I0416 20:33:46.816189 1 dynamic_cafile_content.go:166] Starting request-header::/etc/kubernetes/pki/front-proxy-ca.crt
I0416 20:33:46.816832 1 dynamic_cafile_content.go:166] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
I0416 20:33:46.833031 1 dynamic_serving_content.go:129] Starting serving-cert::/etc/kubernetes/pki/apiserver.crt::/etc/kubernetes/pki/apiserver.key
I0416 20:33:46.853958 1 secure_serving.go:178] Serving securely on [::]:6443
......
......
I0416 20:33:51.784715 1 log.go:172] http: TLS handshake error from 172.24.4.159:60148: EOF
I0416 20:33:51.786804 1 log.go:172] http: TLS handshake error from 172.24.4.159:60150: EOF
I0416 20:33:51.788984 1 log.go:172] http: TLS handshake error from 172.24.4.159:60158: EOF
I0416 20:33:51.790695 1 log.go:172] http: TLS handshake error from 172.24.4.159:60210: EOF
I0416 20:33:51.792577 1 log.go:172] http: TLS handshake error from 172.24.4.159:60214: EOF
I0416 20:33:51.793861 1 log.go:172] http: TLS handshake error from 172.24.4.159:60202: EOF
I0416 20:33:51.805506 1 log.go:172] http: TLS handshake error from 10.6.0.9:35594: EOF
I0416 20:33:51.806056 1 log.go:172] http: TLS handshake error from 172.24.4.159:60120: EOF
......
从 syslog 我可以看到 apiserver 服务证书是为 IP 172.24.4.159 签名的:
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$ grep "apiserver serving cert is signed for DNS names" /var/log/syslog
Apr 16 15:25:56 ubu1910-medflavor-nolb3-control-plane-nh4hf cloud-init[652]: [certs] apiserver serving cert is signed for DNS names [ubu1910-medflavor-nolb3-control-plane-nh4hf kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.6.0.9 172.24.4.159]
从 syslog 中我还可以看到由于“net/http: TLS 握手超时”,kubelet 服务无法访问 apiserver:
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$ tail -F /var/log/syslog
Apr 16 19:36:18 ubu1910-medflavor-nolb3-control-plane-nh4hf kubelet[1504]: E0416 19:36:18.596206 1504 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1beta1.RuntimeClass: Get https://172.24.4.159:6443/apis/node.k8s.io/v1beta1/runtimeclasses?limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 16 19:36:19 ubu1910-medflavor-nolb3-control-plane-nh4hf containerd[568]: time="2021-04-16T19:36:19.202346090Z" level=error msg="Failed to load cni configuration" error="cni config load Failed: no network config found in /etc/cni/net.d: cni plugin not initialized: Failed to load cni config"
Apr 16 19:36:19 ubu1910-medflavor-nolb3-control-plane-nh4hf kubelet[1504]: E0416 19:36:19.274089 1504 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Apr 16 19:36:20 ubu1910-medflavor-nolb3-control-plane-nh4hf kubelet[1504]: W0416 19:36:20.600457 1504 status_manager.go:530] Failed to get status for pod "kube-apiserver-ubu1910-medflavor-nolb3-control-plane-nh4hf_kube-system(24ec7abb1b94172adb053cf6fdd1648c)": Get https://172.24.4.159:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-ubu1910-medflavor-nolb3-control-plane-nh4hf: net/http: TLS handshake timeout
Apr 16 19:36:24 ubu1910-medflavor-nolb3-control-plane-nh4hf containerd[568]: time="2021-04-16T19:36:24.336699210Z" level=error msg="Failed to load cni configuration" error="cni config load Failed: no network config found in /etc/cni/net.d: cni plugin not initialized: Failed to load cni config"
Apr 16 19:36:24 ubu1910-medflavor-nolb3-control-plane-nh4hf kubelet[1504]: E0416 19:36:24.379374 1504 controller.go:135] Failed to ensure node lease exists,will retry in 7s,error: Get https://172.24.4.159:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ubu1910-medflavor-nolb3-control-plane-nh4hf?timeout=10s: context deadline exceeded
......
......
我也尝试用 curl 访问 apiserver,我看到:
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$ curl http://172.24.4.159:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-ubu1910-medflavor-nolb3-control-plane-nh4hf
Client sent an HTTP request to an HTTPS server.
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$ curl https://172.24.4.159:6443/api/v1/namespaces/kube-system/pods/kube-apiserver-ubu1910-medflavor-nolb3-control-plane-nh4hf
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html
curl Failed to verify the legitimacy of the server and therefore Could not
establish a secure connection to it. To learn more about this situation and
how to fix it,please visit the web page mentioned above.
ubuntu@ubu1910-medflavor-nolb3-control-plane-nh4hf:~$
kube-apiserver 的证书有问题吗?知道如何继续进行故障排除吗?
解决方法
如果您想查看 kube-api SSL 证书的详细信息,可以使用 curl -k -v https://172.24.4.159:6443
或 openssl s_client -connect 172.24.4.159:6443
您没有提到您如何配置证书。 kubernetes 中的 SSL 是复杂的野兽,手动设置证书和所有通信可能非常痛苦。这就是人们现在使用 kubeadm
的原因。
TLDR:您必须确保所有证书均由 /etc/kubernetes/pki/ca.crt
签名。
既然你提到了“单节点”,我假设 Kubelet 在同一台服务器上作为 SystemD 单元运行? kube-api 容器是如何启动的?通过 Kubelet 进程本身,因为您在 /etc/kubernetes/manifests
中有 pod 定义?
kubelet
和 kube-api
之间实际上有两种通信方式,并且它们同时使用:
-
kubelet
使用来自kube-api
参数的信息连接并验证到--kubeconfig=/etc/kubernetes/kubelet.conf
(您可以通过ps -aux | grep kubelet
检查)。在文件中,您将看到连接字符串、CA 证书和客户端证书 + 密钥)。 Kubelet 提供来自文件的客户端证书,并由 CA 验证来自同一文件的kube-api
服务器证书。kube-api
使用在其自己的选项--client-ca-file
中定义的 CA 验证客户端证书
-
kube-api
使用kubelet
和--kubelet-client-certificate
选项连接到--kubelet-client-key
。这可能不是问题所在。
因为您可以在 kube-api
端而不是在 kubelet
端看到 SSL 错误。我认为第 n.1 点中描述的通信存在问题。 kubelet
连接到 kube-api
并对其进行身份验证。错误在 kube-api
日志中,所以我想说 kube-api
在验证 kubelet
客户端证书时有问题。所以在--kubeconfig=/etc/kubernetes/kubelet.conf
内检查它。您可以通过 openssl 或一些在线 SSL 证书检查器对其进行 base64 解码并显示详细信息。最重要的部分是它必须由 kube-api
option --client-ca-file
这一切都需要付出很多努力,老实说,您可以采取的最简单方法是扔掉所有东西并使用 kubeadm
来引导单节点集群: