看似无辜的eks worker ami升级后，kong k8s部署失败

问题描述

AWS AMI工作者升级到新版本后，我们在k8s上的kong部署失败。
版本：1.4
旧的AMI版本：amazon-eks-node-1.14-v20200423
新的AMI版本：amazon-eks-node-1.14-v20200723
kubernetes版本：1.14

我看到新的AMI带有一个新的docker版本：06.03.06，而旧的版本附带了18.09.09。这会导致问题吗？

我可以在kong pod日志中看到许多signal 9出口：

2020/08/11 09:00:48 [notice] 1#0: using the "epoll" event method
2020/08/11 09:00:48 [notice] 1#0: openresty/1.15.8.2
2020/08/11 09:00:48 [notice] 1#0: built by gcc 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 
2020/08/11 09:00:48 [notice] 1#0: OS: Linux 4.14.181-140.257.amzn2.x86_64
2020/08/11 09:00:48 [notice] 1#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/08/11 09:00:48 [notice] 1#0: start worker processes
2020/08/11 09:00:48 [notice] 1#0: start worker process 38
2020/08/11 09:00:48 [notice] 1#0: start worker process 39
2020/08/11 09:00:48 [notice] 1#0: start worker process 40
2020/08/11 09:00:48 [notice] 1#0: start worker process 41
2020/08/11 09:00:50 [notice] 1#0: signal 17 (SIGCHLD) received from 40
2020/08/11 09:00:50 [alert] 1#0: worker process 40 exited on signal 9
2020/08/11 09:00:50 [notice] 1#0: start worker process 42
2020/08/11 09:00:51 [notice] 1#0: signal 17 (SIGCHLD) received from 39
2020/08/11 09:00:51 [alert] 1#0: worker process 39 exited on signal 9
2020/08/11 09:00:51 [notice] 1#0: start worker process 43
2020/08/11 09:00:52 [notice] 1#0: signal 17 (SIGCHLD) received from 41
2020/08/11 09:00:52 [alert] 1#0: worker process 41 exited on signal 9
2020/08/11 09:00:52 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:52 [notice] 1#0: start worker process 44
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:243: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:269: randomseed(): random seed: 255136921215 for worker nb 0
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events,event=started,pid=38,data=nil
2020/08/11 09:00:48 [notice] 38#0: *1 [lua] cache_warmup.lua:42: cache_warmup_single_entity(): preloading 'services' into the cache ...,context: init_worker_by_lua*
2020/08/11 09:00:48 [warn] 38#0: *1 [lua] socket.lua:159: tcp(): no support for cosockets in this context,falling back to LuaSocket,context: init_worker_by_lua*
2020/08/11 09:00:53 [notice] 1#0: signal 17 (SIGCHLD) received from 38
2020/08/11 09:00:53 [alert] 1#0: worker process 38 exited on signal 9
2020/08/11 09:00:53 [notice] 1#0: start worker process 45
2020/08/11 09:00:54 [notice] 1#0: signal 17 (SIGCHLD) received from 42
2020/08/11 09:00:54 [alert] 1#0: worker process 42 exited on signal 9
2020/08/11 09:00:54 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:54 [notice] 1#0: start worker process 46
2020/08/11 09:00:55 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:55 [notice] 1#0: signal 17 (SIGCHLD) received from 43
2020/08/11 09:00:55 [alert] 1#0: worker process 43 exited on signal 9
2020/08/11 09:00:55 [notice] 1#0: start worker process 47
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 44
2020/08/11 09:00:56 [alert] 1#0: worker process 44 exited on signal 9
2020/08/11 09:00:56 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:56 [notice] 1#0: start worker process 48
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 45
2020/08/11 09:00:56 [alert] 1#0: worker process 45 exited on signal 9
2020/08/11 09:00:58 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:58 [notice] 1#0: start worker process 49
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 46
2020/08/11 09:00:59 [alert] 1#0: worker process 46 exited on signal 9
2020/08/11 09:00:59 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:59 [notice] 1#0: start worker process 50
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 47

唯一重要的消息是：
[crit] 235#0: *45 [lua] balancer.lua:749: init(): Failed loading initial list of upstreams: Failed to get from node cache: Could not acquire callback lock: timeout,context: ngx.timer

看着kubectl describe pod kong...，我看到OOMKilled 这可能是内存问题吗？

解决方法

新节点ami ulimit (nofile)已更改为1048576，与65536相比有很大变化，这导致我们当前的Kong设置出现内存问题，从而无法部署。

将新的节点文件限制更改为先前的值可修复kong部署。
尽管我们决定增加Kong的内存请求，但这也解决了该问题。

看似无辜的eks worker ami升级后，kong k8s部署失败

问题描述

解决方法

相关问答