副本集是mongodb提供的一种高可用解决方案。相对于原来的主从复制,副本集能自动感知primary节点的下线,并提升其中一个Secondary作为Primary。
整个过程对业务透明,同时也大大降低了运维的成本。
架构图如下:
MongoDB副本集的角色
1. Primary
默认情况下,读写都是在Primary上操作的。
2. Secondary
通过oplog来重放Primary上的所有操作,拥有Primary节点数据的完整拷贝。
默认情况下,不可写,也不可读。
根据不同的需求,Secondary又可配置为如下形式:
1> Priority 0 Replica Set Members
优先级为0的节点,优先级为0的成员永远不会被选举为primary。
在mongoDB副本集中,允许给不同的节点设置不同的优先级。
优先级的取值范围为0-1000,可设置为浮点数,默认为1。
拥有最高优先级的成员会优先选举为primary。
譬如,在副本集中添加了一个优先级为2的成员node3:27020,而其它成员的优先级为1,只要node3:27020拥有最新的数据,那么当前的primary就会自动降
级,node3:27020将会被选举为新的primary节点,但如果node3:27020中的数据不够新,则当前primary节点保持不变,直到node3:27020的数据更新到最新。
2> Hidden Replica Set Members-隐藏节点
隐藏节点的优先级同样为0,同时对客户端不可见
使用rs.status()和rs.config()可以看到隐藏节点,但是对于db.isMaster()不可见。客户端连接到副本集时,会调用db.isMaster()命令来查看可用成员信息。
所以,隐藏节点不会受到客户端的读请求。
隐藏节点常用于执行特定的任务,譬如报表,备份。
3> Delayed Replica Set Members-延迟节点
延迟节点会比primary节点延迟指定的时间(通过slaveDelay参数来指定)
延迟节点必须是隐藏节点。
3. Arbiter
仲裁节点,只是用来投票,且投票的权重只能为1,不复制数据,也不能提升为primary。
仲裁节点常用于节点数量是偶数的副本集中。
建议:通常将Arbiter部署在业务服务器上,切忌将其部署在Primary节点或Secondary节点服务器上。
注:一个副本集最多有50个成员节点,7个投票节点。
MongoDB副本集的搭建
创建数据目录
# mkdir -p /data/27017
# mkdir -p /data/27018
# mkdir -p /data/27019
为了便于查看运行过程中的日志信息,为每个实例创建单独的日志文件
# mkdir -p /var/log/mongodb/
启动mongod实例
# mongod --replSet myapp --dbpath /data/27017 --port 27017 --logpath /var/log/mongodb/27017.log --fork
# mongod --replSet myapp --dbpath /data/27018 --port 27018 --logpath /var/log/mongodb/27018.log --fork
# mongod --replSet myapp --dbpath /data/27019 --port 27019 --logpath /var/log/mongodb/27019.log --fork
以27017端口实例为例,其日志输出信息如下:
2017-05-02T14:05:22.745+0800 I CONTROL [initandlisten] MongoDB starting : pid=2739 port=27017 dbpath=/data/27017 64-bit host=node3 0800 I CONTROL [initandlisten] db version v3.4.2 0800 I CONTROL [initandlisten] git version: 3f76e40c105fc223b3e5aac3e20dcd026b83b38b 0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013 I CONTROL [initandlisten] allocator: tcmalloc I CONTROL [initandlisten] modules: none I CONTROL [initandlisten] build environment: I CONTROL [initandlisten] distmod: rhel62 I CONTROL [initandlisten] distarch: x86_64 I CONTROL [initandlisten] target_arch: x86_64 0800 I CONTROL [initandlisten] options: { net: { port: 27017 },processManagement: { fork: true },replication: { replSet: "myapp" },storage: { dbPath: /data/27017file",path: /var/log/mongodb/27017.log" } } 22.768+0800 I - [initandlisten] 0800 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine2017-0800 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem 22.769+0800 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=256M,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),24.450+ I CONTROL [initandlisten] 24.482+0800 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 0800 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 0800 I CONTROL [initandlisten] ** WARNING: You are running this process as the root user,which is not recommended.24.516+0800 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data/27017/diagnostic.data24.517+0800 I REPL [initandlisten] Did not find local voted document at startup. 0800 I REPL [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument: Did not find replica set configuration document in local.system.replset 24.519+0800 I NETWORK [thread1] waiting for connections on port 27017
通过mongo连接副本集任一成员,在这里,连接27017端口实例
# mongo
初始化副本集
> rs.initiate() { info2" : no configuration specified. Using a default configuration for the set,menode3:27017ok1 }
可通过rs.conf()查看当前副本集的配置信息,
myapp:PRIMARY> rs.conf() { _idversionprotocolVersion" : NumberLong(),1)">members : [ { 0hostarbiterOnly" : falsebuildIndexestruehiddenprioritytags : { },1)">slaveDelayvotes } ],1)">settings : { chainingAllowedheartbeatIntervalMillis2000heartbeatTimeoutSecs10electionTimeoutMillis10000catchUpTimeoutMillisgetLastErrorModes : { },1)">getLastErrorDefaults : { wwtimeout },1)">replicaSetId" : ObjectId(59082229517dd35bb9fd0d2a) } }
其中,settings中的选项解释如下:
chainingAllowed:是否允许级联复制
heartbeatIntervalMillis:心跳检测时间,默认是2s
heartbeatTimeoutSecs:心跳检测失效时间,默认为10s,即如果在10s内没有收到节点的心跳信息,则判断节点不可达(HostUnreachable),对primary和Secondary均适用。
日志输出信息如下:
# vim /var/log/mongodb/27017.log
47.361+0800 I NETWORK [thread1] connection accepted from 127.0.0.1:32824 #1 ( connection now open) 0800 I NETWORK [conn1] received client metadata 32824 conn1: { application: { name: MongoDB ShellMongoDB Internal Client3.4.2LinuxRed Hat Enterprise Linux Server release 6.7 (Santiago)x86_64Kernel 2.6.32-573.el6.x86_6407:36.737+0800 I COMMAND [conn1] initiate : no configuration specified. Using a default configuration for the set 0800 I COMMAND [conn1] created this configuration for initiation : { _id: 1,members: [ { _id: 0,host: } ] } 36.900+0800 I REPL [conn1] replSetInitiate admin command received from client 37.391+0800 I REPL [conn1] replSetInitiate config object with members parses ok 37.410+0800 I REPL [conn1] ****** I REPL [conn1] creating replication oplog of size: 990MB... 37.439+ I STORAGE [conn1] Starting WiredTigerRecordStoreThread local.oplog.rs 37.440+0800 I STORAGE [conn1] The size storer reports that the oplog contains 0 records totaling to bytes 0800 I STORAGE [conn1] Scanning the oplog to determine where to place markers truncation 37.472+37.568+0800 I INDEX [conn1] build index on: admin.system.version properties: { v: 2,key: { version: 1 },1)">incompatible_with_version_32admin.system.version } 0800 I INDEX [conn1] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 37.581+0800 I INDEX [conn1] build index done. scanned 0 total records. secs 37.591+0800 I COMMAND [conn1] setting featureCompatibilityVersion to 3.4 37.601+0800 I REPL [conn1] New replica set config in use: { _id: 1.0,tags: {},slaveDelay: 1 } ],settings: { chainingAllowed: 2000,heartbeatTimeoutSecs: 10,electionTimeoutMillis: 10000,catchUpTimeoutMillis: 0 },replicaSetId: ObjectId(') } } 0800 I REPL [conn1] This node is node3:27017 the config I REPL [conn1] transition to STARTUP2 I REPL [conn1] Starting replication storage threads 37.603+ I REPL [conn1] Starting replication fetcher thread 37.617+ I REPL [conn1] Starting replication applier thread I REPL [conn1] Starting replication reporter thread I REPL [rsSync] transition to RECOVERING 37.628+ I REPL [rsSync] transition to SECONDARY 37.635+0800 I COMMAND [conn1] command local.replset.minvalid appName: " command: replSetInitiate { v: " } numYields:0 reslen:123 locks:{ Global: { acquireCount: { r: 13,w: 7,W: 2 },acquireWaitCount: { W: 53 } },Database: { acquireCount: { r: 5 } },Collection: { acquireCount: { r: 2 } },Metadata: { acquireCount: { w: 1 } },oplog: { acquireCount: { w: 2 } } } protocol:op_command 941ms 37.646+0800 I REPL [rsSync] conducting a dry run election to see if we could be elected 0800 I REPL [ReplicationExecutor] dry election run succeeded,running election 37.675+0800 I REPL [ReplicationExecutor] election succeeded,assuming primary role in term 1 I REPL [ReplicationExecutor] transition to PRIMARY 0800 I REPL [ReplicationExecutor] Could not access any nodes within timeout when checking additional ops to apply before finishing transition to primary. Will move forward with becoming primary anyway. 38.687+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted
添加节点
myapp:PRIMARY> rs.add(node3:27018) { 1 }
27017端口实例的日志信息如下:
05-02T15:54:44.737+0800 I COMMAND [conn1] command local.system.replset appName: " command: count { count: system.replset0 docsExamined:0 numYields:29 locks:{ Global: { acquireCount: { r: } } } protocol:op_command 135ms 44.765+0800 I REPL [conn1] replSetReconfig admin command received 44.808+0800 I REPL [conn1] replSetReconfig config 44.928+0800 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to node3:27018 44.979+0] Successfully connected to node3:44.994+192.168.244.30:38291 #3 (3 connections now open) 45.007+0800 I NETWORK [conn3] received client metadata 38291 conn3: { driver: { name: NetworkInterfaceASIO-Replication45.009+38292 #4 (445.010+0800 I - [conn4] end connection 38292 (45.105+0800 I REPL [ReplicationExecutor] New replica 0800 I REPL [ReplicationExecutor] This node 45.155+0800 I REPL [ReplicationExecutor] Member node3:27018 is now state STARTUP " command: replSetReconfig { replSetReconfig: { _id: " } ],1)">') } } } numYields:22 locks:{ Global: { acquireCount: { r: 3,Database: { acquireCount: { w: } } } protocol:op_command 403ms 47.010+38293 #5 (47.011+0800 I - [conn5] end connection 38293 (47.940+38294 #6 (47.941+0800 I NETWORK [conn6] received client metadata 38294 conn6: { driver: { name: NetworkInterfaceASIO-RS48.010+38295 #7 (548.011+0800 I NETWORK [conn7] received client metadata 38295 conn7: { driver: { name: 49.159+ state SECONDARY 49.160+0800 I - [conn6] end connection 38294 (55:03.401+38296 #8 (03.403+0800 I NETWORK [conn8] received client metadata 38296 conn8: { driver: { name: " } }
27018端口实例的日志信息如下:
44.796+46984 #2 (44.922+0800 I - [conn2] end connection 46984 (44.965+46985 #44.978+46985 conn3: { driver: { name: 44.991+27017 45.008+47.101+0800 I REPL [replExecDBWorker-] Starting replication storage threads 47.174+0800 I REPL [replication-0] Starting initial sync (attempt 1 of ) I REPL [ReplicationExecutor] transition to STARTUP2 47.175+ state PRIMARY 47.217+0] sync source candidate: node3:0800 I STORAGE [replication-0] dropAllDatabasesExceptLocal 0] ****** ] creating replication oplog of size: 990MB... 47.232+] Starting WiredTigerRecordStoreThread local.oplog.rs 0] The size storer reports that the oplog contains 0] Scanning the oplog to determine 47.938+47.939+0800 I ASIO [NetworkInterfaceASIO-RS-48.046+] CollectionCloner::start called,on ns:admin.system.version 48.150+0800 I INDEX [InitialSyncInserters-admin.system.version0] build index on: admin.system.version properties: { v: 0800 I INDEX [InitialSyncInserters-admin.system.version0] building index 48.154+_id_48.155+48.177+0800 I COMMAND [InitialSyncInserters-admin.system.version0] setting featureCompatibilityVersion to 48.221+48.264+0800 I INDEX [InitialSyncInserters-test.blog0] build index on: test.blog properties: { v: test.blog0800 I INDEX [InitialSyncInserters-test.blog0] building index 48.271+1] No need to apply operations. (currently at { : Timestamp 1493711685000| }) 1] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1 }[7804552707376497349] ] Initial sync attempt finishing up. 1] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: new Date(1493711687173),initialSyncAttempts: [],fetchedMissingDocs: 1493711688037),end: 1493711688220),elapsedMillis: 183,admin.system.version: { documentsToCopy: 1493711688046),1)">174 } },test: { collections: 1493711688270),1)">50,test.blog: { documentsToCopy: 1493711688221),1)">49 } } } } 48.352+] initial sync done; took 1s. ] Starting replication fetcher thread ] Starting replication applier thread ] Starting replication reporter thread 48.366+0800 I REPL [rsBackgroundSync] could not find member to sync from 48.367+03.392+0800 I REPL [rsBackgroundSync] sync source candidate: node3:03.396+03.404+27017
添加仲裁节点
myapp:PRIMARY> rs.addArb(node3:270191 }
27017端口实例的日志信息如下:
05-02T16:06:59.098+59.116+27019 59.123+59.124+59.125+38300 #9 (659.127+0800 I - [conn9] end connection 38300 (59.131+27019 59.137+59.223+38304 #10 (59.225+0800 I NETWORK [conn10] received client metadata 38304 conn10: { driver: { name: 59.231+38306 #11 (759.232+0800 I - [conn11] end connection 38306 (01.132+in state ARBITER
27019端口实例的日志信息如下:
59.115+33003 #59.117+0800 I - [conn1] end connection 33003 (33004 #59.122+0800 I NETWORK [conn2] received client metadata 33004 conn2: { driver: { name: 33005 #33007 #59.128+59.135+33007 (59.136+33005 conn3: { driver: { name: 59.214+33008 #59.216+0800 I NETWORK [conn5] received client metadata 33008 conn5: { driver: { name: 59.219+59.227+59.295+ I REPL [ReplicationExecutor] transition to ARBITER 59.297+59.132+33004 (3 connections now open)
检查复制集的状态
myapp:PRIMARY> rs.status() { setdate" : ISODate(2017-05-02T08:10:59.174ZmyStatetermoptimeslastCommittedOpTimets" : Timestamp(1493712649,1)">t) },1)">appliedOpTimedurableOpTime) } },1)">namehealthstatestateStrPRIMARYuptime7537optime : { ) },1)">optimeDate2017-05-02T08:10:49ZelectionTime1493705257,1)">electionDate2017-05-02T06:07:37ZconfigVersionselfSECONDARY974optimeDurableoptimeDurableDatelastHeartbeat2017-05-02T08:10:57.606ZlastHeartbeatRecv2017-05-02T08:10:58.224ZpingMssyncingToARBITER2402017-05-02T08:10:57.607Z2017-05-02T08:10:54.391Z }
副本集也可通过配置文件的方式进行创建
> cfg={ ... "::[ ... {},... {} ... ]} > rs.initiate(cfg)
验证副本集的可用性
在primary中创建一个集合,并插入一个文档进行测试
# mongo myapp:PRIMARY> show dbs; admin .000GB local .000GB myapp:PRIMARY> use test switched to db test myapp:PRIMARY> db.blog.insert({titleMy Blog Post}) WriteResult({ nInserted }) myapp:PRIMARY> db.blog.find(); { 59082731008c534e0763e90a"),1)"> } myapp:PRIMARY> quit()
在secondary中进行验证
# mongo --port 27018 myapp:SECONDARY> use test switched to db test myapp:SECONDARY> db.blog.find() Error: error: { errmsgnot master and slaveOk=falsecode13435codeNameNotMasterNoSlaveOk } myapp:SECONDARY> rs.slaveOk() myapp:SECONDARY> db.blog.find() { } myapp:SECONDARY> quit()
因仲裁节点实际上并不存储任何数据,所以无法通过连接仲裁节点查看刚刚插入的文档
# mongo --port 27019 myapp:ARBITER> use test switched to db test myapp:ARBITER> db.blog.find(); Error: error: { } myapp:ARBITER> rs.slaveOk() myapp:ARBITER>node is not in primary or recovering state13436NotMasterOrSecondary } myapp:ARBITER> quit()
模拟primary宕掉后,副本集的自动切换
# ps -ef |grep mongodb root 2619 1 1 13:59 ? 00:02:58 mongod --replSet myapp --dbpath /data/27018 --port 27018 --logpath /var/log/mongodb /27018.log --forkroot 2643 38 mongod --replSet myapp --dbpath /data/27019 --port 27019 --logpath /27019.log --forkroot 2739 14:05 ? 03:12 mongod --replSet myapp --dbpath /data/27017 --port 27017 --logpath /27017.log --forkroot 3009 2037 0 16:08 pts/2 00 vim /var/log/mongodb/27017.log root 3055 2884 59 pts/5 00 tailf /3071 2209 3 3097 1921 17:00 pts/0 00 grep mongodb # kill -9 2739
检查复制集的状态
在这里,连接27018端口实例
# mongo --port 27018
myapp:PRIMARY> db.isMaster() { hosts : [ ],1)">arbiterssetNamesetVersionismastersecondaryprimaryelectionId7fffffff0000000000000002lastWriteopTime1493716742,1)">lastWriteDate2017-05-02T09:19:02Z) },1)">maxBsonObjectSize16777216maxMessageSizeBytes48000000maxWriteBatchSize1000localTime2017-05-02T09:19:04.870ZmaxWireVersionminWireVersionreadOnly }
可见,primary已经切换到27018端口实例上了。
对应的,27018端口实例的日志输出信息如下:
05-02T17:51.853+0800 I - [conn3] end connection 46985 (7] Restarting oplog query due to error: HostUnreachable: End of file. Last fetched optime (with hash): { ts: Timestamp 1493715649000|1 }[-5996450771261812604]. Restarts remaining: 3 51.878+0800 I ASIO [replication-7] dropping unhealthy pooled connection to node3:] after drop,pool was empty,going to spawn some connections 51.879+7] Scheduled new oplog query Fetcher source: node3:27017 database: local query: { find: oplog.rs60000,term: 1 } query metadata: { $replData: true } } active: 1 timeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 12010 -- target:node3:27017 db:local cmd:{ find: 1 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms 0] Failed to connect to node3:27017 - HostUnreachable: Connection refused 51.880+8] Restarting oplog query due to error: HostUnreachable: Connection refused. Last fetched optime (with hash): { ts: Timestamp 2 8] Scheduled 12013 -- target:node3:7] Restarting oplog query due to error: HostUnreachable: Connection refused. Last fetched optime (with hash): { ts: Timestamp 12015 -- target:node3:51.883+51.884+8] Error returned oplog query (no more query restarts left): HostUnreachable: Connection refused W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: HostUnreachable: Connection refused 0800 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to node3: I ASIO [ReplicationExecutor] after drop,1)">51.885+0800 I REPL [ReplicationExecutor] Error in heartbeat request to node3:; HostUnreachable: Connection refused 51.886+54.837+0800 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to node3:27017: InvalidSyncSource: Sync source was cleared. Was node3:56.886+56.887+01:01.560+0800 I REPL [ReplicationExecutor] Starting an election,since weve seen no PRIMARY in the past 10000ms 01.605+0800 I REPL [ReplicationExecutor] conducting a dry run election to see 01.616+01.626+01.630+0800 I REPL [ReplicationExecutor] VoteRequester(term 1 dry run) failed to receive response from node3:: HostUnreachable: Connection refused 01.637+1 dry run) received a yes vote 27019; response message: { term: "",ok: 1.001.638+01.670+01.672+2) failed to receive response 01.689+2) received a yes vote 01.691+01.692+01.693+0800 I REPL [ReplicationExecutor] My optime is most up-to-date,skipping catch-up and completing transition to primary. 01.694+02.094+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted
从日志输出中可以看出,
在第一次探测到primary不可用时,mongodb会剔除掉不健康连接(dropping unhealthy pooled connection to node3:27017),然后继续探测,直到到达10s(heartbeatTimeoutSecs)的限制,此时进行primary的自动切换。
0800 I REPL [ReplicationExecutor] transition to PRIMARY
实际上,在27017端口实例宕掉的过程中,其它两个节点均会继续针对27017端口实例进行心跳检测
46:08.384+ HostUnreachable: Conn ection refused
2017-; HostUnreachable: Connection refused
当27017端口实例重新上线时,会自动以Secondary角色加入到副本集中
27017端口实例启动并重新加入副本集的日志信息输出如下:
10.616+3141 port=0800 W - [initandlisten] Detected unclean shutdown - /data/27017/mongod.lock not empty. 10.645+0800 I - [initandlisten] Detected data files in /data/27017 created by the wiredTiger' storage engine,so setting the active storage engine to . 0800 W STORAGE [initandlisten] Recovering data the last clean checkpoint. I STORAGE [initandlisten] strongly recommended with the WiredTiger storage engine 11.402+ I STORAGE [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs 11.436+0800 I STORAGE [initandlisten] The size storer reports that the oplog contains 1040 records totaling to 999550800 I STORAGE [initandlisten] Scanning the oplog to determine 11.502+ not recommended. 11.675+' 11.744+11.797+2] New replica 2] This node ] transition to STARTUP2 11.798+11.799+11.801+11.802+ state ARBITER 11.803+12.116+0800 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected,found interim file,some metrics may have been lost. OK 12.388+44011 #12.390+44011 conn1: { driver: { name: 15.744+44012 #15.745+44012 conn2: { driver: { name: 17.802+17.873+17.875+18.203+18.211+27018
参考
1. 《MongoDB实战》
2. 《MongoDB权威指南》
3. 官方文档