Elasticsearch未分配的碎片每隔几个小时发出一次警告

问题描述

我们的集群有3个Elasticsearch数据容器/ 3个主容器/ 1个客户端和1个导出器。问题是警报“由于电路中断异常，Elasticsearch未分配的碎片”。您可以在此question

中对此进行进一步检查

现在，通过进行curl调用http：// localhost：9200 / _nodes / stats，我发现堆使用率是各个数据Pod的平均值。

eleasticsearch-data-0、1和2的heap_used_percent分别为68％，61％和63％。

我在API调用下面进行了操作，可以看到碎片几乎均匀分布。

curl -s http：// localhost：9200 / _cat / shards | grep elasticsearch-data-0 | wc -l

curl -s http：// localhost：9200 / _cat / shards | grep elasticsearch-data-1 | wc -l

curl -s http：// localhost：9200 / _cat / shards | grep elasticsearch-data-2 | wc -l

下面是分配说明curl调用的输出

curl -s http：// localhost：9200 / _cluster / allocation / explain | python -m json.tool

{
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes","can_allocate": "no","current_state": "unassigned","index": "graph_24_18549","node_allocation_decisions": [
        {
            "deciders": [
                {
                    "decider": "max_retry","decision": "NO","explanation": "shard has exceeded the maximum number of retries [50] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry,[unassigned_info[[reason=ALLOCATION_FAILED],at[2020-10-31T09:18:44.115Z],failed_attempts[50],delayed=false,details[failed shard on node [nodeid1]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0],node[nodeid1],[R],recovery_source[peer recovery],s[INITIALIZING],a[id=someid],unassigned_info[[reason=ALLOCATION_FAILED],at[2020-10-31T09:16:42.146Z],failed_attempts[49],details[failed shard on node [nodeid2]: failed to perform indices:data/write/bulk[s] on replica [graph_24_18549][0],node[nodeid2],a[id=someid2],at[2020-10-31T09:15:05.849Z],failed_attempts[48],details[failed shard on node [nodeid1]: failed to perform indices:data/write/bulk[s] on replica [tsg_ngf_graph_1_mtermmetrics1_vertex_24_18549][0],a[id=someid3],at[2020-10-31T09:11:50.143Z],failed_attempts[47],node[o_9jyrmOSca9T12J4bY0Nw],a[id=someid4],at[2020-10-31T09:08:10.182Z],failed_attempts[46],a[id=someid6],at[2020-10-31T09:07:03.102Z],failed_attempts[45],a[id=someid7],at[2020-10-31T09:05:53.267Z],failed_attempts[44],a[id=someid8],at[2020-10-31T09:04:24.507Z],failed_attempts[43],a[id=someid9],at[2020-10-31T09:03:02.018Z],failed_attempts[42],a[id=someid10],at[2020-10-31T09:01:38.094Z],failed_attempts[41],details[failed shard on node [nodeid1]: failed recovery,failure RecoveryFailedException[[graph_24_18549][0]: Recovery failed from {elasticsearch-data-2}{}{} into {elasticsearch-data-1}{}{}{IP}{IP:9300}]; nested: RemoteTransportException[[elasticsearch-data-2][IP:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large,data for [<transport_request>] would be [2012997826/1.8gb],which is larger than the limit of [1972122419/1.8gb],real usage: [2012934784/1.8gb],new bytes reserved: [63042/61.5kb]]; ],allocation_status[no_attempt]],expected_shard_size[4338334540],failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[engine is closed]; ],expected_shard_size[5040039519],failure RemoteTransportException[[elasticsearch-data-1][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large,data for [<transport_request>] would be [2452709390/2.2gb],real usage: [2060112120/1.9gb],new bytes reserved: [392597270/374.4mb]]; ],expected_shard_size[2606804616],expected_shard_size[4799579998],expected_shard_size[4012459974],data for [<transport_request>] would be [2045921066/1.9gb],real usage: [1770141176/1.6gb],new bytes reserved: [275779890/263mb]]; ],expected_shard_size[3764296412],expected_shard_size[2631720247],data for [<transport_request>] would be [2064366222/1.9gb],real usage: [1838754456/1.7gb],new bytes reserved: [225611766/215.1mb]]; ],expected_shard_size[3255872204],failure RemoteTransportException[[elasticsearch-data-0][IP:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large,data for [<transport_request>] would be [2132674062/1.9gb],real usage: [1902340880/1.7gb],new bytes reserved: [230333182/219.6mb]]; ],expected_shard_size[2956220256],data for [<transport_request>] would be [2092139364/1.9gb],real usage: [1855009224/1.7gb],new bytes reserved: [237130140/226.1mb]]; ],allocation_status[no_attempt]]]"
                },{
                    "decider": "same_shard","explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[graph_24_18549][0],[P],s[STARTED],a[id=someid]]"
                }
            ],"node_decision": "no","node_id": "nodeid2","node_name": "elasticsearch-data-2","transport_address": "IP:9300"
        }

现在需要做什么？因为我看不到堆在射击。我已经在API下进行了尝试，该API可以帮助并分配所有未分配的分片，但是问题每隔几个小时就会再次发生。

curl -XPOST'：9200 / _cluster / reroute？retry_failed

解决方法

您使用的是哪个 ElasticSearch 版本？ 7.9.1 和 7.10.1 有更好的 retry failed replication due to CircuitBreakingException 和更好的 indexing pressure

我建议您尝试upgrading you cluster。 7.10.1 版本似乎为我解决了这个问题。查看更多：Help with unassigned shards / CircuitBreakingException / Values less than -1 bytes are not supported

circuit-breaker elastic-stack elasticsearch elk kubernetes

Elasticsearch未分配的碎片每隔几个小时发出一次警告

问题描述

解决方法

相关问答