K8S centos7 3.10.0-862.3.2.el7.x86_64 触发oom 导致内核锁耗死

解决方法升级系统内核

经过测试触发OOM问题

测试:3.10.0-862.3.2.el7.x86_64(内核)

开启7个异常会触发OOM的节点,在一个NODE上,经过测试发现,3.10内核,是并行创建了7个任务,同时触发oom,导致内核锁耗死。测试 2-3分钟内,服务器会死掉,模拟测试连续触发OOM问题直到cpu耗尽。服务器自动重启

kernel: BUG: soft lockup - cpu#4 stuck for 22s! [handler20:1542] 此类也是3.10内核BUG

Nov  6 10:42:55 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,order=0,oom_score_adj=-998
Nov  6 10:42:55 GFS-6 kernel: runc:[1:CHILD] cpuset=c156bcb333882b0a8de413c6e7cbe73867d388dc63d99c7b72d926aa6e669b6a mems_allowed=0
Nov  6 10:43:02 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:02 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:03 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:03 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:07 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:07 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:08 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:08 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:09 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:09 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:11 GFS-6 kernel: runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:11 GFS-6 kernel: runc:[1:CHILD] invoked oom-killer: gfp_mask=0xd0,oom_score_adj=-998
Nov  6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpuset
Nov  6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpu
Nov  6 10:43:58 GFS-6 kernel: Initializing cgroup subsys cpuacct
Nov  6 10:43:58 GFS-6 kernel: setup_percpu: NR_cpuS:5120 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
Nov  6 10:43:58 GFS-6 kernel: PERcpu: Embedded 35 pages/cpu @ffff96fa7fc00000 s104856 r8192 d30312 u262144
Nov  6 10:43:58 GFS-6 kernel: #011RCU restricting cpus from NR_cpuS=5120 to nr_cpu_ids=8.
Nov  6 10:43:58 GFS-6 kernel: core: cpuID marked event: 'cpu cycles' unavailable
Nov  6 10:43:58 GFS-6 kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
Nov  6 10:43:58 GFS-6 kernel: NMI watchdog: Shutting down hard lockup detector on all cpus    <<--cpu全挂了 服务器异常自动重启
Nov  6 10:46:02 GFS-6 systemd: Started Docker Application Container Engine.    << --重启。。                            
Nov  6 10:46:02 GFS-6 systemd: Reached target multi-user System.
Nov  6 10:46:02 GFS-6 systemd: Starting multi-user System.
Nov  6 10:46:02 GFS-6 systemd: Starting Update UTMP about System Runlevel Changes...
Nov  6 10:46:02 GFS-6 systemd: Started Update UTMP about System Runlevel Changes.
Nov  6 10:46:02 GFS-6 systemd: Startup finished in 1.456s (kernel) + 4.661s (initrd) + 9.786s (userspace) = 15.904s.
Nov  6 10:46:05 GFS-6 systemd: kubelet.service holdoff time over,scheduling restart.
Nov  6 10:46:05 GFS-6 systemd: Starting kubelet: The Kubernetes Node Agent...
Nov  6 10:46:05 GFS-6 systemd: Started kubelet: The Kubernetes Node Agent.

 k8s已经无法管理node节点 ,node节点pod节点全挂了

[root@k8s-m1 test]# kubectl get po -o wide --all-namespaces |grep k8snode6
default                            ngx-pod-6f977cf846-7k4vm                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-85mtx                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-hsf6x                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-lt68h                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-mqvcf                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-rmxzj                              0/1       ContainerCreating   0          2m        <none>           k8snode6
default                            ngx-pod-6f977cf846-sgvrd                              0/1       ContainerCreating   0          2m        <none>           k8snode6
kube-system                        kube-proxy-9mtnw                                      0/1       Error               3          125d      10.80.136.179    k8snode6
monitoring                         kube-prometheus-node-exporter-xbf9k                   0/1       Error               1          63d       10.80.136.179    k8snode6

调整内核 4.1.19,测试触发OOM问题

开启7个异常会触发OOM的节点,在一个NODE上

测试:4.19.1-1.el7.elrepo.x86_64(内核) 

测试发现,4.19内核创建任务,非并向,暂时无法触发内核锁BUG。

[root@k8snode7-180v136-taiji ~]# tail -f /var/log/messages|grep oom_kill
Nov  6 11:32:58 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:32:59 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:00 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:01 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:02 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:02 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:03 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:04 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:05 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:06 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:07 GFS-7 kernel: oom_kill_process+0x262/0x290
Nov  6 11:33:08 GFS-7 kernel: oom_kill_process+0x262/0x290
......................
[root@k8s-m1 test]# kubectl get po --all-namespaces -o wide |grep k8snode7
default                            ngx-pod-74c88d6495-79krh                              0/1       ContainerCreating   0          33m       <none>           k8snode7
kube-system                        kube-proxy-xt4c7                                      1/1       Running             1          55d       10.80.136.180    k8snode7
monitoring                         kube-prometheus-node-exporter-bbsjn                   1/1       Running             1          60d       10.80.136.180    k8snode7

总结:暂时灰度部分服务器升级内核到4.1.19。后续补充

升级内核操作

1.源
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
2、列出可用的系统内核相关包
yum --disablerepo="*" --enablerepo="elrepo-kernel" list available
3、安装最新的主线稳定内核
yum --enablerepo=elrepo-kernel install kernel-ml kernel-ml-devel -y
查看认启动顺序
awk -F\' '$1=="menuentry " {print $2}' /etc/grub2.cfg
认启动的顺序是从0开始,但我们新内核是从头插入(目前位置在1,而4.0.2的是在0),所以需要选择0,如果想生效最新的内核,需要
 
grub2-set-default 0
grub2-mkconfig -o /boot/grub2/grub.cfg
cat /boot/grub2/grub.cfg
yum remove kernel-3.10.0-327.el7.x86_64 kernel-devel-3.10.0-327.el7.x86_64 -y

自定义内核

下面链接可以下载到其他归档版本的

下面是ml的内核和上面归档内核版本任选其一的安装方法

自选版本内核安装方法

export Kernel_Version=4.18.9-1   
4.20.13-1
wget http://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/kernel-ml{,-devel}-${Kernel_Version}.el7.elrepo.x86_64.rpm
 
wget http://mirror.rc.usf.edu/compute_lock/elrepo/kernel/el7/x86_64/RPMS/kernel-ml{,-devel}-${4.20.13-1}.el7.elrepo.x86_64.rpm
yum localinstall -y kernel-ml*​

相关文章

功能概要:(目前已实现功能)公共展示部分:1.网站首页展示...
大体上把Python中的数据类型分为如下几类: Number(数字) ...
开发之前第一步,就是构造整个的项目结构。这就好比作一幅画...
源码编译方式安装Apache首先下载Apache源码压缩包,地址为ht...
前面说完了此项目的创建及数据模型设计的过程。如果未看过,...
python中常用的写爬虫的库有urllib2、requests,对于大多数比...