问题描述
我有一个 CuratorFramework
客户端 (v5.1.0) 在 Zookeeper 服务器 (v3.7.0) 上运行。如果
Zookeeper 服务器关闭,而客户端连接到它我可以看到连接状态
(带有ConnectionStateListener
)SUSPENDED
,然后是LOST
,然后仅此而已
当服务器恢复在线时。
这感觉像是一个非常标准的用例,我一定遗漏了一些愚蠢的东西,但我永远不会 服务器上线后,让客户端重新连接。
我已经做了一些谷歌搜索,但没有发现关于如何在 LOST 状态后处理恢复的有用信息。
我有 self-contained example 我正在做的事情 中的示例代码 CuratorRecoveryTest 班 (在 IDE 而不是 maven 中运行)。它的主要内容是(摘自测试类):
// setup the server and client
server = new TestingServer();
client = newClient(server.getConnectString(),60000,15000,new RetryNTimes(1,250));
client.start();
client.blockUntilConnected();
// add the listener
final var stateListener = new StateListener();
stateListener.stateChanged(client,CONNECTED);
// register the listener
client.getConnectionStateListenable().addListener(stateListener);
// verify connection
assertTrue(client.getZookeeperClient().isConnected());
// let things settle
nap(3,"initial settling");
// stop zk
stopServer();
log.info(">>>>>>>>>> STOPPED ZK SERVER");
// let it bake
nap(3,"letting things bake");
// ensure disconnected
assertFalse(client.getZookeeperClient().isConnected());
nap(3,"disconnecting");
// start zk
server.start();
log.info(">>>>>>>>>> STARTED ZK SERVER");
await().atMost(5,MINUTES).until(() -> stateListener.getCurrentState() == CONNECTED || stateListener.getCurrentState() == RECONNECTED);
// NOTE: it never gets here - no state changes after LOST
assertTrue(client.getZookeeperClient().isConnected());
运行时,我得到以下输出:
[Thread-0] INFO org.apache.curator.test.TestingZooKeeperMain - Starting server
[Thread-0] WARN org.apache.zookeeper.server.ServerCnxnFactory - maxCnxns is not configured,using default value 0.
[main] INFO org.apache.curator.framework.imps.CuratorFrameworkImpl - Starting
[main] INFO org.apache.curator.framework.imps.CuratorFrameworkImpl - Default schema
[main] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: null --> CONNECTED
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for initial settling...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for initial settling...
[Curator-ConnectionStateManager-0] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: CONNECTED --> SUSPENDED
[main] INFO demo.CuratorRecoveryTest - >>>>>>>>>> STOPPED ZK SERVER
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for letting things bake...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for letting things bake...
[main] DEBUG demo.CuratorRecoveryTest - Taking a 3s nap for disconnecting...
[main] DEBUG demo.CuratorRecoveryTest - Done napping for disconnecting...
[main] INFO demo.CuratorRecoveryTest - >>>>>>>>>> STARTED ZK SERVER
[Curator-ConnectionStateManager-0] WARN org.apache.curator.framework.state.ConnectionStateManager - Session timeout has elapsed while SUSPENDED. Injecting a session expiration. Elapsed ms: 20009. Adjusted session timeout ms: 20000
[main-EventThread] WARN org.apache.curator.ConnectionState - Session expired event received
[Curator-ConnectionStateManager-0] WARN demo.CuratorRecoveryTest - CONNECTION-STATE-CHANGE: SUSPENDED --> LOST
然后在等待条件从未发生时失败。
注意:这也发生在 Curator 和 Zookeeper 的旧版本组合上,因此这不是“出血边缘”问题。
我错过了什么?
解决方法
我遇到了类似的问题,并得出结论,当 Zookeeper 服务器重新启动时,策展人似乎重用了过时的 IP。
this ticket 中概述的方法对我有用。特别是,this commit 添加了一个自定义 ZookeeperFactory
,它不会重用以前的过时 IP,而是使用原始未解析的主机名。
简而言之,在创建 curator 时,分配一个自定义的 ZookeeperFactory
CuratorFramework zkClient = CuratorFrameworkFactory
.builder()
...
.zookeeperFactory(new ZKClientFactory())
此 ZKClientFactory
从缓存的 Zookeeper
创建一个新的 connectString
。