当zookeeper重新上线时,为什么策展人没有恢复?

问题描述

我有一个 CuratorFramework 客户端 (v5.1.0) 在 Zookeeper 服务器 (v3.7.0) 上运行。如果 Zookeeper 服务器关闭,而客户端连接到它我可以看到连接状态 (带有ConnectionStateListenerSUSPENDED,然后是LOST,然后仅此而已 当服务器恢复在线时。

这感觉像是一个非常标准的用例,我一定遗漏了一些愚蠢的东西,但我永远不会 服务器上线后,让客户端重新连接。

我已经做了一些谷歌搜索,但没有发现关于如何在 LOST 状态后处理恢复的有用信息。

我有 self-contained example 我正在做的事情 中的示例代码 CuratorRecoveryTest 班 (在 IDE 而不是 maven 中运行)。它的主要内容是(摘自测试类):

// setup the server and client
server = new TestingServer();

client = newClient(server.getConnectString(),60000,15000,new RetryNTimes(1,250));
client.start();
client.blockUntilConnected();
            
// add the listener
final var stateListener = new StateListener();
stateListener.stateChanged(client,CONNECTED);

// register the listener
client.getConnectionStateListenable().addListener(stateListener);

// verify connection
assertTrue(client.getZookeeperClient().isConnected());

// let things settle
nap(3,"initial settling");

// stop zk
stopServer();
log.info(">>>>>>>>>> STOPPED ZK SERVER");

// let it bake
nap(3,"letting things bake");

// ensure disconnected
assertFalse(client.getZookeeperClient().isConnected());

nap(3,"disconnecting");

// start zk
server.start();
log.info(">>>>>>>>>> STARTED ZK SERVER");

await().atMost(5,MINUTES).until(() -> stateListener.getCurrentState() == CONNECTED || stateListener.getCurrentState() == RECONNECTED);

// NOTE: it never gets here - no state changes after LOST

assertTrue(client.getZookeeperClient().isConnected());

运行时,我得到以下输出:

[Thread-0] INFO org.apache.curator.test.TestingZooKeeperMain - Starting server
[Thread-0] WARN org.apache.zookeeper.server.ServerCnxnFactory - maxCnxns is not configured,using default value 0.
[main] INFO org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting
[main] INFO org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema
[main] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: null --> CONNECTED
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for initial settling...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for initial settling...
[Curator-ConnectionStateManager-0] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: CONNECTED --> SUSPENDED
[main] INFO demo.CuratorRecoveryTest - >>>>>>>>>> STOPPED ZK SERVER
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for letting things bake...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for letting things bake...
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for disconnecting...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for disconnecting...
[main] INFO demo.CuratorRecoveryTest - >>>>>>>>>> STARTED ZK SERVER
[Curator-ConnectionStateManager-0] WARN org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 20009. Adjusted session timeout ms: 20000
[main-EventThread] WARN org.apache.curator.ConnectionState - Session expired event received
[Curator-ConnectionStateManager-0] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: SUSPENDED --> LOST

然后在等待条件从未发生时失败。

注意:这也发生在 Curator 和 Zookeeper 的旧版本组合上,因此这不是“出血边缘”问题。

我错过了什么?

解决方法

我遇到了类似的问题,并得出结论,当 Zookeeper 服务器重新启动时,策展人似乎重用了过时的 IP。

this ticket 中概述的方法对我有用。特别是,this commit 添加了一个自定义 ZookeeperFactory,它不会重用以前的过时 IP,而是使用原始未解析的主机名。

简而言之,在创建 curator 时,分配一个自定义的 ZookeeperFactory

CuratorFramework zkClient = CuratorFrameworkFactory
    .builder()
...
    .zookeeperFactory(new ZKClientFactory())

ZKClientFactory 从缓存的 Zookeeper 创建一个新的 connectString

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...