Cloudera Manager - 服务监控崩溃

问题描述

去年我一直在使用 Cloudera 管理器,运行 20 多个节点。最近我开始在服务监视器角色中看到堆内存大小问题。我从 3 GB 增加到 4 GB,然后从 4 GB 增加到 5 GB,然后又从 5 GB 增加到 6 GB。但是,我有时会导致服务监视器崩溃并重新启动。在此期间,整个仪表板看起来很糟糕。我需要在这里做什么来解决这个问题?。

日志是

2021-04-26 16:10:34,938 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC,or JVM not scheduled): paused approximately 20583ms: GC pool 'G1 Young Generation' had collection(s): count=2 time=182ms,GC pool 'G1 Old Generation' had collection(s): count=1 time=20877ms 2021-04-26 16:11:34,862 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC,or JVM not scheduled): paused approximately 19870ms: GC pool 'G1 Young Generation' had collection(s): count=2 time=131ms,GC pool 'G1 Old Generation' had collection(s): count=1 time=20228ms 2021-04-26 16:12:35,132 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC,or JVM not scheduled): paused approximately 20427ms: GC pool 'G1 Young Generation' had collection(s): count=3 time=149ms,GC pool 'G1 Old Generation' had collection(s): count=1 time=20733ms 2021-04-26 16:13:36,415 WARN com.cloudera.enterprise.debug.JvmPauseMonitor: Detected pause in JVM or host machine (e.g. a stop the world GC,or JVM not scheduled): paused approximately 19008ms: GC pool 'G1 Young Generation' had collection(s): count=1 time=104ms,GC pool 'G1 Old Generation' had collection(s): count=1 time=19381ms

你能帮我解决这个问题吗?

解决方法

Service Monitor 可能会根据集群中的主机数量、服务类型和当前正在监控的实体数量占用更高的内存。 基于上述因素,here 给出了明确的指导。

您可能需要根据集群使用情况保持堆大小增加。同一页面上有一些调优技巧,例如使用 G1GC。

众所周知,HBase、Solr、Kafka 和 Kudu 会生成大量实体并增加服务监视器堆要求。

如果您有 Cloudera 支持订阅,请提交案例以获得专家的官方支持。