看似无辜的eks worker ami升级后,kong k8s部署失败

问题描述

AWS AMI工作者升级到新版本后,我们在k8s上的kong部署失败。
版本:1.4
旧的AMI版本:amazon-eks-node-1.14-v20200423
新的AMI版本:amazon-eks-node-1.14-v20200723
kubernetes版本:1.14

我看到新的AMI带有一个新的docker版本:06.03.06,而旧的版本附带了18.09.09。这会导致问题吗?

我可以在kong pod日志中看到许多signal 9出口:

2020/08/11 09:00:48 [notice] 1#0: using the "epoll" event method
2020/08/11 09:00:48 [notice] 1#0: openresty/1.15.8.2
2020/08/11 09:00:48 [notice] 1#0: built by gcc 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 
2020/08/11 09:00:48 [notice] 1#0: OS: Linux 4.14.181-140.257.amzn2.x86_64
2020/08/11 09:00:48 [notice] 1#0: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2020/08/11 09:00:48 [notice] 1#0: start worker processes
2020/08/11 09:00:48 [notice] 1#0: start worker process 38
2020/08/11 09:00:48 [notice] 1#0: start worker process 39
2020/08/11 09:00:48 [notice] 1#0: start worker process 40
2020/08/11 09:00:48 [notice] 1#0: start worker process 41
2020/08/11 09:00:50 [notice] 1#0: signal 17 (SIGCHLD) received from 40
2020/08/11 09:00:50 [alert] 1#0: worker process 40 exited on signal 9
2020/08/11 09:00:50 [notice] 1#0: start worker process 42
2020/08/11 09:00:51 [notice] 1#0: signal 17 (SIGCHLD) received from 39
2020/08/11 09:00:51 [alert] 1#0: worker process 39 exited on signal 9
2020/08/11 09:00:51 [notice] 1#0: start worker process 43
2020/08/11 09:00:52 [notice] 1#0: signal 17 (SIGCHLD) received from 41
2020/08/11 09:00:52 [alert] 1#0: worker process 41 exited on signal 9
2020/08/11 09:00:52 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:52 [notice] 1#0: start worker process 44
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:243: randomseed(): seeding PRNG from OpenSSL RAND_bytes()
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] globalpatches.lua:269: randomseed(): random seed: 255136921215 for worker nb 0
2020/08/11 09:00:48 [debug] 38#0: *1 [lua] events.lua:211: do_event_json(): worker-events: handling event; source=resty-worker-events,event=started,pid=38,data=nil
2020/08/11 09:00:48 [notice] 38#0: *1 [lua] cache_warmup.lua:42: cache_warmup_single_entity(): preloading 'services' into the cache ...,context: init_worker_by_lua*
2020/08/11 09:00:48 [warn] 38#0: *1 [lua] socket.lua:159: tcp(): no support for cosockets in this context,falling back to LuaSocket,context: init_worker_by_lua*
2020/08/11 09:00:53 [notice] 1#0: signal 17 (SIGCHLD) received from 38
2020/08/11 09:00:53 [alert] 1#0: worker process 38 exited on signal 9
2020/08/11 09:00:53 [notice] 1#0: start worker process 45
2020/08/11 09:00:54 [notice] 1#0: signal 17 (SIGCHLD) received from 42
2020/08/11 09:00:54 [alert] 1#0: worker process 42 exited on signal 9
2020/08/11 09:00:54 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:54 [notice] 1#0: start worker process 46
2020/08/11 09:00:55 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:55 [notice] 1#0: signal 17 (SIGCHLD) received from 43
2020/08/11 09:00:55 [alert] 1#0: worker process 43 exited on signal 9
2020/08/11 09:00:55 [notice] 1#0: start worker process 47
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 44
2020/08/11 09:00:56 [alert] 1#0: worker process 44 exited on signal 9
2020/08/11 09:00:56 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:56 [notice] 1#0: start worker process 48
2020/08/11 09:00:56 [notice] 1#0: signal 17 (SIGCHLD) received from 45
2020/08/11 09:00:56 [alert] 1#0: worker process 45 exited on signal 9
2020/08/11 09:00:58 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:58 [notice] 1#0: start worker process 49
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 46
2020/08/11 09:00:59 [alert] 1#0: worker process 46 exited on signal 9
2020/08/11 09:00:59 [notice] 1#0: signal 29 (SIGIO) received
2020/08/11 09:00:59 [notice] 1#0: start worker process 50
2020/08/11 09:00:59 [notice] 1#0: signal 17 (SIGCHLD) received from 47

唯一重要的消息是:
[crit] 235#0: *45 [lua] balancer.lua:749: init(): Failed loading initial list of upstreams: Failed to get from node cache: Could not acquire callback lock: timeout,context: ngx.timer

看着kubectl describe pod kong...,我看到OOMKilled 这可能是内存问题吗?

解决方法

新节点ami ulimit (nofile)已更改为1048576,与65536相比有很大变化,这导致我们当前的Kong设置出现内存问题,从而无法部署。

将新的节点文件限制更改为先前的值可修复kong部署。
尽管我们决定增加Kong的内存请求,但这也解决了该问题。

相关的github issue

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...