Kubernetes上的Apache Ignite未加入集群

问题描述

我正在尝试使用kubernetes建立一个简单的两节点点火集群。直接在VM上运行时,相同的配置可以正常工作。

基本上,我有两个Pod,它们是用Vertx编写的,以Ignite作为嵌入式节点的微服务,pod1通过service1公开9090,pod2通过service2公开9092

两个Pod均使用ignite-service公开Ignite发现端口47100和47500,并且两个Pod均实现KubernetesIPFinder

pod1 --> service1(9090,10900)   |
                                 | --> ignite-service (47100/TCP,47500/TCP)
pod2 --> service2(9092,10900)   |               ^
                                                 |
    KubernetesIPFinder----------------------------
            ns = ignite-ns
            svc = ignite-service
        ServiceAccount (ignite-account)

当两个Pod都启动时,我可以看到发现的发生,但是第二个Pod始终挂在下面的日志中。我不确定这是因为我配置k8s对象的方式还是k8s中的某些资源争用。

如果将配置更改为将瘦客户端用于Pod,则一切正常。 Pod能够启动并暴露vertx应用程序的其余端点

[INFO ] 2020-08-26 16:09:08.969 [main] IgniteKernal%aztecCommunityUserIgnite - VM arguments: [-xms1g,-Xmx1g,-XX:MaxGCPauseMillis=500,-XX:GCPauseIntervalMillis=30000,-XX:InitiatingHeapOccupancyPercent=60,-XX:G1ReservePercent=30,-XX:+HeapDumpOnOutOfMemoryError,-XX:+disableExplicitGC,-Djava.net.preferIPv4Stack=true,-XX:+UseG1GC,-Xlog:gc*,safepoint,age*,ergo*:file=/app/aztec/logs/gc-%p-%t.log:tags,uptime,time,level:filecount=10,filesize=50m,-DIGNITE_PERFORMANCE_SUGGESTIONS_disABLED=true,-DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=300000,-Dlog4j.configurationFile=file:///app/aztec/communityuser_service/conf/log4j2.xml,-DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true,-DIGNITE_NO_SHUTDOWN_HOOK=true,-DIGNITE_WAL_MMAP=false]
[INFO ] 2020-08-26 16:09:08.970 [main] IgniteKernal%aztecCommunityUserIgnite - System cache's DataRegion size is configured to 40 MB. Use DataStorageConfiguration.systemRegionInitialSize property to change the setting.
[INFO ] 2020-08-26 16:09:08.970 [main] IgniteKernal%aztecCommunityUserIgnite - Configured caches [in 'sysMemPlc' dataRegion: ['ignite-sys-cache']]
[INFO ] 2020-08-26 16:09:09.054 [main] IgnitePluginProcessor - Configured plugins:
[INFO ] 2020-08-26 16:09:09.054 [main] IgnitePluginProcessor -   ^-- None
[INFO ] 2020-08-26 16:09:09.054 [main] IgnitePluginProcessor -
[INFO ] 2020-08-26 16:09:09.059 [main] FailureProcessor - Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false,timeout=0,super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYstem_WORKER_BLOCKED,SYstem_CRITICAL_OPERATION_TIMEOUT]]]]
[WARN ] 2020-08-26 16:09:09.278 [main] TcpCommunicationSpi - Failure detection timeout will be ignored (one of SPI parameters has been set explicitly)
[INFO ] 2020-08-26 16:09:09.299 [main] TcpCommunicationSpi - Successfully bound communication NIO server to TCP port [port=47100,locHost=0.0.0.0/0.0.0.0,selectorsCnt=4,selectorSpins=0,pairedConn=false]
[WARN ] 2020-08-26 16:09:09.302 [main] TcpCommunicationSpi - Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[WARN ] 2020-08-26 16:09:09.312 [main] NoopCheckpointSpi - Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[WARN ] 2020-08-26 16:09:09.337 [main] GridCollisionManager - Collision resolution is disabled (all jobs will be activated upon arrival).
[INFO ] 2020-08-26 16:09:09.341 [main] IgniteKernal%aztecCommunityUserIgnite - Security status [authentication=off,tls/ssl=off]
[INFO ] 2020-08-26 16:09:09.392 [main] TcpdiscoverySpi - Successfully bound to TCP port [port=47500,localHost=0.0.0.0/0.0.0.0,locNodeId=11e43ce8-b846-41ac-b688-9c6c34aebcf9]
[INFO ] 2020-08-26 16:09:09.421 [main] PdsFoldersResolver - Successfully created new persistent storage folder [/app/aztec/data/ignite/db/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c]
[INFO ] 2020-08-26 16:09:09.422 [main] PdsFoldersResolver - Consistent ID used for local node is [6cd407c6-0c86-4e57-9803-ab56bec5b16c] according to persistence data storage folders
[INFO ] 2020-08-26 16:09:09.423 [main] CacheObjectBinaryProcessorImpl - Resolved directory for serialized binary Metadata: /app/aztec/data/ignite/binary_Meta/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.637 [main] FilePageStoreManager - Resolved page store work directory: /app/aztec/data/ignite/db/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.637 [main] FileWriteAheadLogManager - Resolved write ahead log work directory: /app/aztec/data/ignite/db/wal/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.638 [main] FileWriteAheadLogManager - Resolved write ahead log archive directory: /app/aztec/data/ignite/db/wal/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.951 [main] FileHandleManagerImpl - Initialized write-ahead log manager [mode=BACKGROUND]
[WARN ] 2020-08-26 16:09:09.954 [main] GridCacheDatabaseSharedManager - DataRegionConfiguration.maxwalarchiveSize instead DataRegionConfiguration.walHistorySize would be used for removing old archive wal files
[INFO ] 2020-08-26 16:09:09.975 [main] GridCacheDatabaseSharedManager - Configured data regions initialized successfully [total=4]
[INFO ] 2020-08-26 16:09:09.993 [main] PartitionsevictManager - evict partition permits=2
[WARN ] 2020-08-26 16:09:10.029 [main] IgniteH2Indexing - Serialization of Java objects in H2 was enabled.
[INFO ] 2020-08-26 16:09:10.251 [main] ClientListenerProcessor - Client connector processor has started on TCP port 10900
[INFO ] 2020-08-26 16:09:10.324 [main] GridTcpRestProtocol - Command protocol successfully started [name=TCP binary,host=0.0.0.0/0.0.0.0,port=11211]
[INFO ] 2020-08-26 16:09:10.374 [main] IgniteKernal%aztecCommunityUserIgnite - Non-loopback local IPs: 172.17.239.163
[INFO ] 2020-08-26 16:09:10.375 [main] IgniteKernal%aztecCommunityUserIgnite - Enabled local MACs: 2255F14C9361
[INFO ] 2020-08-26 16:09:10.381 [main] GridCacheDatabaseSharedManager - Read checkpoint status [startMarker=null,endMarker=null]
[INFO ] 2020-08-26 16:09:10.388 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB,pages=24814,tableSize=1.9 MiB,checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.391 [main] GridCacheDatabaseSharedManager - Checking memory state [lastValidPos=FileWALPointer [idx=0,fileOff=0,len=0],lastMarked=FileWALPointer [idx=0,lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.428 [main] GridCacheDatabaseSharedManager - Applying lost cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=0,lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.430 [main] GridCacheDatabaseSharedManager - Finished applying WAL changes [updatesApplied=0,time=0 ms]
[INFO ] 2020-08-26 16:09:10.430 [main] GridCacheProcessor - Restoring partition state for local groups.
[INFO ] 2020-08-26 16:09:10.430 [main] GridCacheProcessor - Finished restoring partition state for local groups [groupsProcessed=0,partitionsprocessed=0,time=0ms]
[INFO ] 2020-08-26 16:09:10.483 [main] FilePageStoreManager - Cleanup cache stores [total=1,left=0,cleanFiles=false]
[INFO ] 2020-08-26 16:09:10.491 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB,checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.492 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB,checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.493 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB,checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.502 [main] GridCacheDatabaseSharedManager - Configured data regions started successfully [total=4]
[INFO ] 2020-08-26 16:09:10.503 [main] GridCacheDatabaseSharedManager - Starting binary memory restore for: [-2100569601]
[INFO ] 2020-08-26 16:09:10.518 [main] GridCacheDatabaseSharedManager - Read checkpoint status [startMarker=null,endMarker=null]
[INFO ] 2020-08-26 16:09:10.518 [main] GridCacheDatabaseSharedManager - Checking memory state [lastValidPos=FileWALPointer [idx=0,lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.522 [main] FileWriteAheadLogManager - Resuming logging to WAL segment [file=/app/aztec/data/ignite/db/wal/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c/0000000000000000.wal,offset=0,ver=2]
[INFO ] 2020-08-26 16:09:10.684 [main] GridCacheProcessor - Started cache in recovery mode [name=ignite-sys-cache,id=-2100569601,dataRegionName=sysMemPlc,mode=REPLICATED,atomicity=TRANSACTIONAL,backups=2147483647,mvcc=false]
[INFO ] 2020-08-26 16:09:10.689 [main] GridCacheDatabaseSharedManager - Binary recovery performed in 186 ms.
[INFO ] 2020-08-26 16:09:10.690 [main] GridCacheDatabaseSharedManager - Read checkpoint status [startMarker=null,endMarker=null]
[INFO ] 2020-08-26 16:09:10.690 [main] GridCacheDatabaseSharedManager - Applying lost cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=0,lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.692 [main] GridCacheDatabaseSharedManager - Finished applying WAL changes [updatesApplied=0,time=0 ms]
[INFO ] 2020-08-26 16:09:10.692 [main] GridCacheProcessor - Restoring partition state for local groups.
[INFO ] 2020-08-26 16:09:10.703 [main] GridCacheProcessor - Finished restoring partition state for local groups [groupsProcessed=1,time=10ms]
[INFO ] 2020-08-26 16:09:10.738 [main] TcpdiscoverySpi - Connection check threshold is calculated: 300000

[INFO ] 2020-08-26 16:11:18.387 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpdiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/172.17.239.64,rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.395 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpdiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/172.17.239.64,rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.396 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpdiscoverySpi - Started serving remote node connection [rmtAddr=/172.17.239.64:34837,rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.399 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpdiscoverySpi - Received ping request from the remote node [rmtNodeId=f4df02cf-0700-4f31-93b0-9073c9394d2d,rmtAddr=/172.17.239.64:34837,rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.400 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpdiscoverySpi - Finished writing ping response [rmtNodeId=f4df02cf-0700-4f31-93b0-9073c9394d2d,rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.400 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpdiscoverySpi - Finished serving remote node connection [rmtAddr=/172.17.239.64:34837,rmtPort=34837
[INFO ] 2020-08-26 16:13:25.749 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpdiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/172.17.239.64,rmtPort=36858]
[INFO ] 2020-08-26 16:13:25.749 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpdiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/172.17.239.64,rmtPort=36858]
[INFO ] 2020-08-26 16:13:25.750 [tcp-disco-sock-reader-[]-#5%aztecCommunityUserIgnite%] TcpdiscoverySpi - Started serving remote node connection [rmtAddr=/172.17.239.64:36858,rmtPort=36858]
[INFO ] 2020-08-26 16:13:25.752 [tcp-disco-sock-reader-[f4df02cf 172.17.239.64:36858]-#5%aztecCommunityUserIgnite%] TcpdiscoverySpi - Initialized connection with Remote Server node [nodeId=f4df02cf-0700-4f31-93b0-9073c9394d2d,rmtAddr=/172.17.239.64:36858]
[INFO ] 2020-08-26 16:13:25.772 [tcp-disco-msg-worker-[]-#2%aztecCommunityUserIgnite%] TcpdiscoverySpi - New next node [newNext=TcpdiscoveryNode [id=f4df02cf-0700-4f31-93b0-9073c9394d2d,consistentId=b003163e-ef90-450a-885c-6d7e9b0cbef4,addrs=ArrayList [127.0.0.1,172.17.193.243],sockAddrs=HashSet [sit-aztec-authentication-service/192.168.164.225:47500,/127.0.0.1:47500,/172.17.193.243:47500],discPort=47500,order=1,intOrder=1,lastExchangeTime=1598458405757,loc=false,ver=2.8.1#20200521-sha1:86422096,isClient=false]]


更新:

IgniteConfiguration:

[INFO ] 2020-08-26 16:25:03.364 [main] IgniteKernal%aztecAuthIgnite - IgniteConfiguration [igniteInstanceName=aztecAuthIgnite,pubPoolSize=8,svcPoolSize=8,callbackPoo
lSize=8,stripedPoolSize=8,sysPoolSize=8,mgmtPoolSize=4,igfsPoolSize=1,dataStreamerPoolSize=8,utilityCachePoolSize=8,utilityCacheKeepAliveTime=60000,p2pPoolSize=
2,qryPoolSize=8,sqlQryHistSize=1000,dfltQryTimeout=0,igniteHome=null,igniteworkdir=/app/aztec/data/ignite,mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@d554c5f,nodeId=3b17a57c-6ee6-4225-bc50-a762f6ec50af,marsh=BinaryMarshaller [],marshLocJobs=false,daemon=false,p2pEnabled=false,netTimeout=150000,netCompressionLevel=1,s
ndRetryDelay=1000,sndRetryCnt=3,metricsHistSize=10000,metricsUpdateFreq=2000,metricsExpTime=9223372036854775807,discoSpi=TcpdiscoverySpi [addrRslvr=null,sockTimeo
ut=0,ackTimeout=0,marsh=null,reconCnt=10,reconDelay=2000,maxAckTimeout=600000,soLinger=5,forceSrvMode=false,clientReconnectdisabled=false,internalLsnr=null,sk
ipAddrsRandomization=false],segPlc=STOP,segResolveAttempts=2,waitForSegOnStart=true,allResolversPassReq=true,segChkFreq=10000,commSpi=TcpCommunicationSpi [connect
Gate=null,connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@60c38c44,chConnPlc=null,enableForcibleNodeKill=false,enableTroub
leshootingLog=false,locAddr=null,locHost=null,locPort=47100,locPortRange=100,shmemPort=-1,directBuf=true,directSndBuf=false,idleConnTimeout=600000,connTimeout=
5000,maxConnTimeout=600000,sockSndBuf=32768,sockRcvBuf=32768,msgQueueLimit=0,slowClientQueueLimit=0,nioSrvr=null,shmemSrv=null,usePairedConnections
=false,connectionsPerNode=1,tcpNoDelay=true,filterReachableAddresses=false,ackSndThreshold=32,unackedMsgsBufSize=0,sockWriteTimeout=2000,boundTcpPort=-1,boundTc
pShmemPort=-1,addrRslvr=null,ctxInitLatch=java.util.concurrent.CountDownLatch@1ee2a1e2[Count = 1],stopping=false,metricslsnr=null],evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@59ae2de7,colSpi=NoopCollisionSpi [],deploySpi=LocalDeploymentSpi [],indexingSpi=org.apache.ignite.spi.
indexing.noop.NoopIndexingSpi@38bb9fad,encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@11620476,clientMode=false,rebalanceThrea
dPoolSize=4,rebalanceTimeout=10000,rebalanceBatchesPrefetchCnt=3,rebalanceThrottle=0,rebalanceBatchSize=524288,txCfg=TransactionConfiguration [txSerEnabled=false,dfltIsolation=REPEATABLE_READ,dfltConcurrency=pessimistic,dfltTxTimeout=0,txTimeoutOnPartitionMapExchange=0,deadlockTimeout=10000,pessimisticTxLogSize=0,pessimist
icTxLogLinger=10000,tmLookupClsName=null,txManagerFactory=null,useJtaSync=false],cacheSanityCheckEnabled=true,discoStartupDelay=60000,deployMode=SHARED,p2pMissed
CacheSize=100,timeSrvPortBase=31100,timeSrvPortRange=100,failureDetectionTimeout=300000,sysWorkerBlockedTimeout=null,clientFailureDetectionTimeout=30
000,metricslogFreq=60000,hadoopCfg=null,connectorCfg=ConnectorConfiguration [jettyPath=null,host=null,port=11211,noDelay=true,directBuf=false,sndBufSize=32768,rcvBufSize=32768,idleQryCurTimeout=600000,idleQryCurCheckFreq=60000,sndQueueLimit=0,selectorCnt=1,idleTimeout=7000,sslEnabled=false,sslClientAuth=false,sslCtxFa
ctory=null,sslFactory=null,portRange=100,threadPoolSize=8,msginterceptor=null],odbcCfg=null,warmupClos=null,atomicCfg=AtomicConfiguration [seqReserveSize=1000,c
acheMode=PARTITIONED,backups=1,aff=null,grpname=null],classLdr=null,sslCtxFactory=null,platformCfg=null,binaryCfg=null,memCfg=null,pstCfg=null,dsCfg=DataStora
geConfiguration [sysRegionInitSize=41943040,sysRegionMaxSize=104857600,pageSize=4096,concLvl=0,dfltDataRegConf=DataRegionConfiguration [name=Default_Region,maxSize
=131072000,initSize=26214400,swapPath=null,pageevictionMode=disABLED,evictionThreshold=0.9,emptyPagesPoolSize=100,metricsEnabled=false,metricsSubIntervalCount=5,metricsRateTimeInterval=60000,persistenceEnabled=true,checkpointPageBufSize=0,lazyMemoryAllocation=true],dataRegions=null,storagePath=db,checkpointFreq=60000,lo
ckWaitTime=10000,checkpointThreads=4,checkpointWriteOrder=SEQUENTIAL,walHistSize=20,maxwalarchiveSize=250000000,walSegments=4,walSegmentSize=67108864,walPath=db/wal,walarchivePath=db/wal,walMode=BACKGROUND,walTlbSize=131072,walBuffSize=33554432,walFlushFreq=5000,walFsyncDelay=1000,walRecordIterBuffSize=67108864,alwaysWriteFullPages=false,fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@25a02442,metricsSubIntervalCnt=5,walautoArchiveAfterInactivity=-1,writeThrottlingEnabled=true,walCompactionEnabled=false,walCompactionLevel=1,checkpointReadLockTimeout=null,walPageCompression=disABLED,walPageCompressionLevel=null],activeOnStart=true,autoActivation=false,longQryWarnTimeout=3000,sqlConnCfg=null,cliConnCfg=ClientConnectorConfiguration [host=sit-aztec-authentication-service,port=10900,portRange=10,sockSndBufSize=0,sockRcvBufSize=0,maxOpenCursorsPerConn=64,idleTimeout=0,handshakeTimeout=10000,jdbcEnabled=true,odbcEnabled=true,thinCliEnabled=true,useIgniteSslCtxFactory=true,thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100]],mvccVacuumThreadCnt=2,mvccVacuumFreq=5000,authEnabled=false,failureHnd=null,commFailureRslvr=null]

解决方法

我想出了这里的问题。显然,这与我在kubernetes中配置服务对象的方式有关。我不确定这是错误还是功能,但Ignite节点只能缩放到节点,而不能跨节点扩展。 我的意思是,服务对象应该是节点唯一的。如果您在节点(微服务)之间共享服务对象,并期望群集分布在多个节点上,它将挂起。 (我不确定这是否是反模式) 有效的方法是使服务对象对节点唯一,然后根据需要扩展节点。

我认为如果是这种情况,那么我们应该将点火节点作为单独的群集而不是嵌入微服务中。

,

根据您的Kubernetes部署,您可能已经在readinessProbe上定义了spec.template.spec.container

这将防止Pod在Kubernetes Endpoints下注册为Service,并且每个Ignite嵌入式节点都将以其自己的cluster 1个节点开始:-/

尝试不使用readinessProbe,看看您的Ignite节点是否加入了同一群集。

请参阅Ignite ReadinessProbe