ambari_agent 重启导致 ansible 崩溃

问题描述

我们有基于 horton-works HDP 2.6.4 版和 ambari 2.6.1 版的大数据 Hadoop 集群

所有机器都是RHEL 7.2版本

在我们的集群中,我们有超过 540 台机器,并且在所有机器上我们都有与 ambari 服务器通信的 ambari-agent,(Ambari 服务器仅安装在一台机器上)而 ambari-agent 安装在所有机器上

直到使用 ansible 一切都很好,当我们做 ambari-agent 升级和 ambari-agent 重启时

但最近我们开始使用 ansible ( ansible-playbook ) 来自动化安装

并且 ansible 在所有机器上运行

所以当任务执行 ambari-agent restart 时,我们很快就会注意到 ansible 执行停止并被杀死

经过一些调查,我们发现 ambari 代理正在使用以下端口

url_port = 8440
secured_url_port = 8441
ping_port = 8670

但我没有看到在端口上面使用了任何 ansible 进程,所以我们认为它不相关

但基本问题很清楚

当在远程机器上执行 ansible 任务 - ambari-agent restart 时,它会导致 ansible 中断和 ansible 被杀死

ambari-agent 配置如下

[server]
hostname = datanode02.gtfactory.com
url_port = 8440
secured_url_port = 8441
connect_retry_delay = 10
max_reconnect_retry_delay = 30

[agent]
logdir = /var/log/ambari-agent
piddir = /var/run/ambari-agent
prefix = /var/lib/ambari-agent/data
loglevel = INFO
data_cleanup_interval = 86400
data_cleanup_max_age = 2592000
data_cleanup_max_size_mb = 100
ping_port = 8670
cache_dir = /var/lib/ambari-agent/cache
tolerate_download_failures = true
run_as_user = root
parallel_execution = 0
alert_grace_period = 5
status_command_timeout = 5
alert_kinit_timeout = 14400000
system_resource_overrides = /etc/resource_overrides

[security]
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PAsspHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell

[network]
use_system_proxy_settings = true

[services]
pidlookuppath = /var/run/

[heartbeat]
state_interval_seconds = 60
dirs = /etc/hadoop,/etc/hadoop/conf,/etc/hbase,/etc/hcatalog,/etc/hive,/etc/oozie,/etc/sqoop,/var/run/hadoop,/var/run/zookeeper,/var/run/hbase,/var/run/templeton,/var/run/oozie,/var/log/hadoop,/var/log/zookeeper,/var/log/hbase,/var/log/hive
log_lines_count = 300
idle_interval_min = 1
idle_interval_max = 10

[logging]
syslog_enabled = 0

目前我们正在考虑以下事项:

可能因为 TLSv1 受限(传输层安全)导致 ansible 崩溃,认是 ambari-agent 连接到 TLSv1

所以我们认为在 ambari 代理配置中设置 force_https_protocol=PROTOCOL_TLSv1_2 ,但这只是假设

我们的建议和可能有帮助的新配置?

[security]
force_https_protocol=PROTOCOL_TLSv1_2     <------ the new update
keysdir = /var/lib/ambari-agent/keys
server_crt = ca.crt
passphrase_env_var_name = AMBARI_PAsspHRASE
ssl_verify_cert = 0
credential_lib_dir = /var/lib/ambari-agent/cred/lib
credential_conf_dir = /var/lib/ambari-agent/cred/conf
credential_shell_cmd = org.apache.hadoop.security.alias.CredentialShell

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)