MongoDB副本集的搭建

副本集是mongodb提供的一种高可用解决方案。相对于原来的主从复制，副本集能自动感知primary节点的下线，并提升其中一个Secondary作为Primary。

整个过程对业务透明，同时也大大降低了运维的成本。

架构图如下：

MongoDB副本集的角色

1. Primary

默认情况下，读写都是在Primary上操作的。

2. Secondary

通过oplog来重放Primary上的所有操作，拥有Primary节点数据的完整拷贝。

默认情况下，不可写，也不可读。

根据不同的需求，Secondary又可配置为如下形式：

1> Priority 0 Replica Set Members

优先级为0的节点，优先级为0的成员永远不会被选举为primary。

在mongoDB副本集中，允许给不同的节点设置不同的优先级。

优先级的取值范围为0-1000，可设置为浮点数，默认为1。

拥有最高优先级的成员会优先选举为primary。

譬如，在副本集中添加了一个优先级为2的成员node3:27020，而其它成员的优先级为1，只要node3:27020拥有最新的数据，那么当前的primary就会自动降

级，node3:27020将会被选举为新的primary节点，但如果node3:27020中的数据不够新，则当前primary节点保持不变，直到node3:27020的数据更新到最新。

2> Hidden Replica Set Members-隐藏节点

隐藏节点的优先级同样为0，同时对客户端不可见

使用rs.status()和rs.config()可以看到隐藏节点，但是对于db.isMaster()不可见。客户端连接到副本集时，会调用db.isMaster()命令来查看可用成员信息。

所以，隐藏节点不会受到客户端的读请求。

隐藏节点常用于执行特定的任务，譬如报表，备份。

3> Delayed Replica Set Members-延迟节点

延迟节点会比primary节点延迟指定的时间（通过slaveDelay参数来指定）

延迟节点必须是隐藏节点。

3. Arbiter

仲裁节点，只是用来投票，且投票的权重只能为1，不复制数据，也不能提升为primary。

仲裁节点常用于节点数量是偶数的副本集中。

建议：通常将Arbiter部署在业务服务器上，切忌将其部署在Primary节点或Secondary节点服务器上。

注：一个副本集最多有50个成员节点，7个投票节点。

MongoDB副本集的搭建

创建数据目录

# mkdir -p /data/27017

# mkdir -p /data/27018

# mkdir -p /data/27019

为了便于查看运行过程中的日志信息，为每个实例创建单独的日志文件

# mkdir -p /var/log/mongodb/

启动mongod实例

# mongod --replSet myapp --dbpath /data/27017 --port 27017 --logpath /var/log/mongodb/27017.log --fork

# mongod --replSet myapp --dbpath /data/27018 --port 27018 --logpath /var/log/mongodb/27018.log --fork

# mongod --replSet myapp --dbpath /data/27019 --port 27019 --logpath /var/log/mongodb/27019.log --fork

以27017端口实例为例，其日志输出信息如下：

2017-05-02T14:05:22.745+0800 I CONTROL  [initandlisten] MongoDB starting : pid=2739 port=27017 dbpath=/data/27017 64-bit host=node3
0800 I CONTROL  [initandlisten] db version v3.4.2
0800 I CONTROL  [initandlisten] git version: 3f76e40c105fc223b3e5aac3e20dcd026b83b38b
0800 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
 I CONTROL  [initandlisten] allocator: tcmalloc
 I CONTROL  [initandlisten] modules: none
 I CONTROL  [initandlisten] build environment:
 I CONTROL  [initandlisten]     distmod: rhel62
 I CONTROL  [initandlisten]     distarch: x86_64
 I CONTROL  [initandlisten]     target_arch: x86_64
0800 I CONTROL  [initandlisten] options: { net: { port: 27017 },processManagement: { fork: true },replication: { replSet: "myapp" },storage: { dbPath: /data/27017file",path: /var/log/mongodb/27017.log" } }
22.768+0800 I -        [initandlisten] 0800 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine2017-0800 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
22.769+0800 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=256M,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),24.450+ I CONTROL  [initandlisten] 
24.482+0800 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
0800 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
0800 I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user,which is not recommended.24.516+0800 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/27017/diagnostic.data24.517+0800 I REPL     [initandlisten] Did not find local voted  document at startup.
0800 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
24.519+0800 I NETWORK  [thread1] waiting for connections on port 27017

View Code

通过mongo连接副本集任一成员，在这里，连接27017端口实例

# mongo

初始化副本集

> rs.initiate()
{
    info2" : no configuration specified. Using a default configuration for the set,menode3:27017ok1
}

可通过rs.conf()查看当前副本集的配置信息，

myapp:PRIMARY> rs.conf()
{
    _idversionprotocolVersion" : NumberLong(),1)">members : [
        {
            0hostarbiterOnly" : falsebuildIndexestruehiddenprioritytags : {
                
            },1)">slaveDelayvotes
        }
    ],1)">settings : {
        chainingAllowedheartbeatIntervalMillis2000heartbeatTimeoutSecs10electionTimeoutMillis10000catchUpTimeoutMillisgetLastErrorModes : {
            
        },1)">getLastErrorDefaults : {
            wwtimeout
        },1)">replicaSetId" : ObjectId(59082229517dd35bb9fd0d2a)
    }
}

其中，settings中的选项解释如下：

chainingAllowed：是否允许级联复制

heartbeatIntervalMillis：心跳检测时间，默认是2s

heartbeatTimeoutSecs：心跳检测失效时间，默认为10s，即如果在10s内没有收到节点的心跳信息，则判断节点不可达（HostUnreachable），对primary和Secondary均适用。

日志输出信息如下：

# vim /var/log/mongodb/27017.log

47.361+0800 I NETWORK  [thread1] connection accepted from 127.0.0.1:32824 #1 ( connection now open)
0800 I NETWORK  [conn1] received client metadata 32824 conn1: { application: { name: MongoDB ShellMongoDB Internal Client3.4.2LinuxRed Hat Enterprise Linux Server release 6.7 (Santiago)x86_64Kernel 2.6.32-573.el6.x86_6407:36.737+0800 I COMMAND  [conn1] initiate : no configuration specified. Using a default configuration for the set
0800 I COMMAND  [conn1] created this configuration for initiation : { _id: 1,members: [ { _id: 0,host:  } ] }
36.900+0800 I REPL     [conn1] replSetInitiate admin command received from client
37.391+0800 I REPL     [conn1] replSetInitiate config object with  members parses ok
37.410+0800 I REPL     [conn1] ******
 I REPL     [conn1] creating replication oplog of size: 990MB...
37.439+ I STORAGE  [conn1] Starting WiredTigerRecordStoreThread local.oplog.rs
37.440+0800 I STORAGE  [conn1] The size storer reports that the oplog contains 0 records totaling to  bytes
0800 I STORAGE  [conn1] Scanning the oplog to determine where to place markers  truncation
37.472+37.568+0800 I INDEX    [conn1] build index on: admin.system.version properties: { v: 2,key: { version: 1 },1)">incompatible_with_version_32admin.system.version }
0800 I INDEX    [conn1]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
37.581+0800 I INDEX    [conn1] build index done.  scanned 0 total records.  secs
37.591+0800 I COMMAND  [conn1] setting featureCompatibilityVersion to 3.4
37.601+0800 I REPL     [conn1] New replica set config in use: { _id: 1.0,tags: {},slaveDelay: 1 } ],settings: { chainingAllowed: 2000,heartbeatTimeoutSecs: 10,electionTimeoutMillis: 10000,catchUpTimeoutMillis: 0 },replicaSetId: ObjectId(') } }
0800 I REPL     [conn1] This node is node3:27017  the config
 I REPL     [conn1] transition to STARTUP2
 I REPL     [conn1] Starting replication storage threads
37.603+ I REPL     [conn1] Starting replication fetcher thread
37.617+ I REPL     [conn1] Starting replication applier thread
 I REPL     [conn1] Starting replication reporter thread
 I REPL     [rsSync] transition to RECOVERING
37.628+ I REPL     [rsSync] transition to SECONDARY
37.635+0800 I COMMAND  [conn1] command local.replset.minvalid appName: " command: replSetInitiate { v: " } numYields:0 reslen:123 locks:{ Global: { acquireCount: { r: 13,w: 7,W: 2 },acquireWaitCount: { W: 53 } },Database: { acquireCount: { r: 5 } },Collection: { acquireCount: { r: 2 } },Metadata: { acquireCount: { w: 1 } },oplog: { acquireCount: { w: 2 } } } protocol:op_command 941ms
37.646+0800 I REPL     [rsSync] conducting a dry run election to see if we could be elected
0800 I REPL     [ReplicationExecutor] dry election run succeeded,running  election
37.675+0800 I REPL     [ReplicationExecutor] election succeeded,assuming primary role in term 1
 I REPL     [ReplicationExecutor] transition to PRIMARY
0800 I REPL     [ReplicationExecutor] Could not access any nodes within timeout when checking  additional ops to apply before finishing transition to primary. Will move forward with becoming primary anyway.
38.687+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted

View Code

添加节点

myapp:PRIMARY> rs.add(node3:27018)
{ 1 }

27017端口实例的日志信息如下：

05-02T15:54:44.737+0800 I COMMAND  [conn1] command local.system.replset appName: " command: count { count: system.replset0 docsExamined:0 numYields:29 locks:{ Global: { acquireCount: { r:  } } } protocol:op_command 135ms
44.765+0800 I REPL     [conn1] replSetReconfig admin command received 44.808+0800 I REPL     [conn1] replSetReconfig config 44.928+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to node3:27018
44.979+0] Successfully connected to node3:44.994+192.168.244.30:38291 #3 (3 connections now open)
45.007+0800 I NETWORK  [conn3] received client metadata 38291 conn3: { driver: { name: NetworkInterfaceASIO-Replication45.009+38292 #4 (445.010+0800 I -        [conn4] end connection 38292 (45.105+0800 I REPL     [ReplicationExecutor] New replica 0800 I REPL     [ReplicationExecutor] This node 45.155+0800 I REPL     [ReplicationExecutor] Member node3:27018 is now  state STARTUP
" command: replSetReconfig { replSetReconfig: { _id: " } ],1)">') } } } numYields:22 locks:{ Global: { acquireCount: { r: 3,Database: { acquireCount: { w:  } } } protocol:op_command 403ms
47.010+38293 #5 (47.011+0800 I -        [conn5] end connection 38293 (47.940+38294 #6 (47.941+0800 I NETWORK  [conn6] received client metadata 38294 conn6: { driver: { name: NetworkInterfaceASIO-RS48.010+38295 #7 (548.011+0800 I NETWORK  [conn7] received client metadata 38295 conn7: { driver: { name: 49.159+ state SECONDARY
49.160+0800 I -        [conn6] end connection 38294 (55:03.401+38296 #8 (03.403+0800 I NETWORK  [conn8] received client metadata 38296 conn8: { driver: { name: " } }

View Code

27018端口实例的日志信息如下：

44.796+46984 #2 (44.922+0800 I -        [conn2] end connection 46984 (44.965+46985 #44.978+46985 conn3: { driver: { name: 44.991+27017
45.008+47.101+0800 I REPL     [replExecDBWorker-] Starting replication storage threads
47.174+0800 I REPL     [replication-0] Starting initial sync (attempt 1 of )
 I REPL     [ReplicationExecutor] transition to STARTUP2
47.175+ state PRIMARY
47.217+0] sync source candidate: node3:0800 I STORAGE  [replication-0] dropAllDatabasesExceptLocal 0] ******
] creating replication oplog of size: 990MB...
47.232+] Starting WiredTigerRecordStoreThread local.oplog.rs
0] The size storer reports that the oplog contains 0] Scanning the oplog to determine 47.938+47.939+0800 I ASIO     [NetworkInterfaceASIO-RS-48.046+] CollectionCloner::start called,on ns:admin.system.version
48.150+0800 I INDEX    [InitialSyncInserters-admin.system.version0] build index on: admin.system.version properties: { v: 0800 I INDEX    [InitialSyncInserters-admin.system.version0]      building index 48.154+_id_48.155+48.177+0800 I COMMAND  [InitialSyncInserters-admin.system.version0] setting featureCompatibilityVersion to 48.221+48.264+0800 I INDEX    [InitialSyncInserters-test.blog0] build index on: test.blog properties: { v: test.blog0800 I INDEX    [InitialSyncInserters-test.blog0]      building index 48.271+1] No need to apply operations. (currently at { : Timestamp 1493711685000| })
1] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1 }[7804552707376497349]
] Initial sync attempt finishing up.
1] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: new Date(1493711687173),initialSyncAttempts: [],fetchedMissingDocs: 1493711688037),end: 1493711688220),elapsedMillis: 183,admin.system.version: { documentsToCopy: 1493711688046),1)">174 } },test: { collections: 1493711688270),1)">50,test.blog: { documentsToCopy: 1493711688221),1)">49 } } } }
48.352+] initial sync done; took 1s.
] Starting replication fetcher thread
] Starting replication applier thread
] Starting replication reporter thread
48.366+0800 I REPL     [rsBackgroundSync] could not find member to sync from
48.367+03.392+0800 I REPL     [rsBackgroundSync] sync source candidate: node3:03.396+03.404+27017

View Code

添加仲裁节点

myapp:PRIMARY> rs.addArb(node3:270191 }

27017端口实例的日志信息如下：

05-02T16:06:59.098+59.116+27019
59.123+59.124+59.125+38300 #9 (659.127+0800 I -        [conn9] end connection 38300 (59.131+27019 59.137+59.223+38304 #10 (59.225+0800 I NETWORK  [conn10] received client metadata 38304 conn10: { driver: { name: 59.231+38306 #11 (759.232+0800 I -        [conn11] end connection 38306 (01.132+in state ARBITER

View Code

27019端口实例的日志信息如下：

59.115+33003 #59.117+0800 I -        [conn1] end connection 33003 (33004 #59.122+0800 I NETWORK  [conn2] received client metadata 33004 conn2: { driver: { name: 33005 #33007 #59.128+59.135+33007 (59.136+33005 conn3: { driver: { name: 59.214+33008 #59.216+0800 I NETWORK  [conn5] received client metadata 33008 conn5: { driver: { name: 59.219+59.227+59.295+ I REPL     [ReplicationExecutor] transition to ARBITER
59.297+59.132+33004 (3 connections now open)

View Code

检查复制集的状态

myapp:PRIMARY> rs.status()
{
    setdate" : ISODate(2017-05-02T08:10:59.174ZmyStatetermoptimeslastCommittedOpTimets" : Timestamp(1493712649,1)">t)
        },1)">appliedOpTimedurableOpTime)
        }
    },1)">namehealthstatestateStrPRIMARYuptime7537optime : {
                )
            },1)">optimeDate2017-05-02T08:10:49ZelectionTime1493705257,1)">electionDate2017-05-02T06:07:37ZconfigVersionselfSECONDARY974optimeDurableoptimeDurableDatelastHeartbeat2017-05-02T08:10:57.606ZlastHeartbeatRecv2017-05-02T08:10:58.224ZpingMssyncingToARBITER2402017-05-02T08:10:57.607Z2017-05-02T08:10:54.391Z
}

副本集也可通过配置文件的方式进行创建

> cfg={
...     "::[
...         {},...         {}
... ]}

> rs.initiate(cfg)

验证副本集的可用性

在primary中创建一个集合，并插入一个文档进行测试

# mongo
myapp:PRIMARY> show dbs;
admin  .000GB
local  .000GB
myapp:PRIMARY> use test
switched to db test
myapp:PRIMARY> db.blog.insert({titleMy Blog Post})
WriteResult({ nInserted })
myapp:PRIMARY> db.blog.find();
{ 59082731008c534e0763e90a"),1)"> }
myapp:PRIMARY> quit()

在secondary中进行验证

# mongo --port 27018
myapp:SECONDARY> use test
switched to db test
myapp:SECONDARY> db.blog.find()
Error: error: {
    errmsgnot master and slaveOk=falsecode13435codeNameNotMasterNoSlaveOk
}
myapp:SECONDARY> rs.slaveOk()
myapp:SECONDARY> db.blog.find()
{  }
myapp:SECONDARY> quit()

因仲裁节点实际上并不存储任何数据，所以无法通过连接仲裁节点查看刚刚插入的文档

# mongo --port 27019
myapp:ARBITER> use test
switched to db test
myapp:ARBITER> db.blog.find();
Error: error: {
    
}
myapp:ARBITER> rs.slaveOk()
myapp:ARBITER>node is not in primary or recovering state13436NotMasterOrSecondary
}
myapp:ARBITER> quit()

模拟primary宕掉后，副本集的自动切换

# ps -ef |grep mongodb
root       2619      1  1 13:59 ?        00:02:58 mongod --replSet myapp --dbpath /data/27018 --port 27018 --logpath /var/log/mongodb
/27018.log --forkroot       2643      38 mongod --replSet myapp --dbpath /data/27019 --port 27019 --logpath /27019.log --forkroot       2739      14:05 ?        03:12 mongod --replSet myapp --dbpath /data/27017 --port 27017 --logpath /27017.log --forkroot       3009   2037  0 16:08 pts/2    00 vim /var/log/mongodb/27017.log
root       3055   2884  59 pts/5    00 tailf /3071   2209  3    3097   1921  17:00 pts/0    00 grep mongodb
# kill -9 2739

检查复制集的状态

在这里，连接27018端口实例

# mongo --port 27018

myapp:PRIMARY> db.isMaster() 
{
    hosts : [
        
    ],1)">arbiterssetNamesetVersionismastersecondaryprimaryelectionId7fffffff0000000000000002lastWriteopTime1493716742,1)">lastWriteDate2017-05-02T09:19:02Z)
    },1)">maxBsonObjectSize16777216maxMessageSizeBytes48000000maxWriteBatchSize1000localTime2017-05-02T09:19:04.870ZmaxWireVersionminWireVersionreadOnly
}

可见，primary已经切换到27018端口实例上了。

对应的，27018端口实例的日志输出信息如下：

05-02T17:51.853+0800 I -        [conn3] end connection 46985 (7] Restarting oplog query due to error: HostUnreachable: End of file. Last fetched optime (with hash): { ts: Timestamp 1493715649000|1 }[-5996450771261812604]. Restarts remaining: 3
51.878+0800 I ASIO     [replication-7] dropping unhealthy pooled connection to node3:] after drop,pool was empty,going to spawn some connections
51.879+7] Scheduled new oplog query Fetcher source: node3:27017 database: local query: { find: oplog.rs60000,term: 1 } query metadata: { $replData: true } } active: 1 timeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 12010 -- target:node3:27017 db:local cmd:{ find: 1 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
0] Failed to connect to node3:27017 - HostUnreachable: Connection refused
51.880+8] Restarting oplog query due to error: HostUnreachable: Connection refused. Last fetched optime (with hash): { ts: Timestamp 2
8] Scheduled 12013 -- target:node3:7] Restarting oplog query due to error: HostUnreachable: Connection refused. Last fetched optime (with hash): { ts: Timestamp 12015 -- target:node3:51.883+51.884+8] Error returned  oplog query (no more query restarts left): HostUnreachable: Connection refused
 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: HostUnreachable: Connection refused
0800 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to node3: I ASIO     [ReplicationExecutor] after drop,1)">51.885+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to node3:; HostUnreachable: Connection refused
51.886+54.837+0800 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to node3:27017: InvalidSyncSource: Sync source was cleared. Was node3:56.886+56.887+01:01.560+0800 I REPL     [ReplicationExecutor] Starting an election,since weve seen no PRIMARY in the past 10000ms
01.605+0800 I REPL     [ReplicationExecutor] conducting a dry run election to see 01.616+01.626+01.630+0800 I REPL     [ReplicationExecutor] VoteRequester(term 1 dry run) failed to receive response from node3:: HostUnreachable: Connection refused
01.637+1 dry run) received a yes vote 27019; response message: { term: "",ok: 1.001.638+01.670+01.672+2) failed to receive response 01.689+2) received a yes vote 01.691+01.692+01.693+0800 I REPL     [ReplicationExecutor] My optime is most up-to-date,skipping catch-up and completing transition to primary.
01.694+02.094+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted

View Code

从日志输出中可以看出，

在第一次探测到primary不可用时，mongodb会剔除掉不健康连接（dropping unhealthy pooled connection to node3:27017），然后继续探测，直到到达10s（heartbeatTimeoutSecs）的限制，此时进行primary的自动切换。

0800 I REPL     [ReplicationExecutor] transition to PRIMARY

View Code

实际上，在27017端口实例宕掉的过程中，其它两个节点均会继续针对27017端口实例进行心跳检测

46:08.384+ HostUnreachable: Conn
ection refused
2017-; HostUnreachable: Connection
refused

当27017端口实例重新上线时，会自动以Secondary角色加入到副本集中

27017端口实例启动并重新加入副本集的日志信息输出如下：

10.616+3141 port=0800 W -        [initandlisten] Detected unclean shutdown - /data/27017/mongod.lock  not empty.
10.645+0800 I -        [initandlisten] Detected data files in /data/27017 created by the wiredTiger' storage engine,so setting the active storage engine to .
0800 W STORAGE  [initandlisten] Recovering data  the last clean checkpoint.
 I STORAGE  [initandlisten] 
 strongly recommended with the WiredTiger storage engine
11.402+ I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
11.436+0800 I STORAGE  [initandlisten] The size storer reports that the oplog contains 1040 records totaling to 999550800 I STORAGE  [initandlisten] Scanning the oplog to determine 11.502+ not recommended.
11.675+'
11.744+11.797+2] New replica 2] This node ] transition to STARTUP2
11.798+11.799+11.801+11.802+ state ARBITER
11.803+12.116+0800 I FTDC     [ftdc] Unclean full-time diagnostic data capture shutdown detected,found interim file,some metrics may have been lost. OK
12.388+44011 #12.390+44011 conn1: { driver: { name: 15.744+44012 #15.745+44012 conn2: { driver: { name: 17.802+17.873+17.875+18.203+18.211+27018

View Code

参考

1. 《MongoDB实战》

2. 《MongoDB权威指南》

3. 官方文档

MongoDB副本集的搭建

相关文章