问题描述
我有 3 台 postgres-11 数据库服务器在运行,其中一台作为源服务器,另外两台通过订阅从源复制数据。 1 个目标 DB 位于同一网络上,而另一个位于不同网络上并通过防火墙分隔。设置是这样的:
A (source DB- publication) ----------(same network)-------> B (target DB - subscription)
|
|------(firewall)-----------> C (target DB - subscription)
最近我在源数据库上收到一个错误,请求的 WAL 文件已被删除。
2021-06-23 10:10:14.937 JST,"user","sourcedb",2xxxx9,"1x.xx.xxx.xxx:2xxxx",xxxxxx.xxx,4,"idle",2021-06-23 10:10:14 JST,15/0,ERROR,58P01,"requested WAL segment 0000000100010E0600000030 has already been removed","sub_xxx_xxx_xxx"
在检查时,我看到 target DB C
上的订阅失败并停止。 target DB B
上的订阅完全没有问题(可能是因为它在同一网络上)。
所有 3 个 DB 上的设置都相同。在 source DB A
和其他 2 上,文件从 pg_wal
归档到磁盘上名为 /opt/postgresql/archivewal
的另一个目录。
但是,从这个 /opt/postgresql/archivewal
文件夹中,cron 作业每 2 分钟删除一次这些文件,并将它们存储在另一台备份服务器上。
这是 postgresql.conf
文件。所有服务器基本相同
# Connection settings
# -------------------
listen_addresses = '*'
port = 5532
max_connections = 400
tcp_keepalives_idle = 0
tcp_keepalives_interval = 0
tcp_keepalives_count = 0
# Memory-related settings
# -----------------------
shared_buffers = 32GB # Physical memory 1/4
##DEBUG: mmap(1652555776) with MAP_HUGETLB Failed,huge pages disabled: Cannot allocate memory
#huge_pages = try # on,off,or try
#temp_buffers = 16MB # depends on DB checklist
work_mem = 8MB # Need tuning
effective_cache_size = 64GB # Physical memory 1/2
maintenance_work_mem = 512MB
wal_buffers = 64MB
# WAL/Replication/HA settings
# --------------------
wal_level = logical
synchronous_commit = remote_write
archive_mode = on
archive_command = 'rsync -a %p /opt/postgresql/archivewal/%f'
#archive_command = ':'
max_wal_senders=5
hot_standby = on
restart_after_crash = off
wal_sender_timeout = 5000
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
hot_standby_Feedback = on
random_page_cost = 1.5
max_wal_size = 5GB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
# Logging settings
# ----------------
log_destination = 'csvlog,syslog'
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql_%Y%m%d.log'
log_truncate_on_rotation = off
log_rotation_age = 1h
log_rotation_size = 0
log_timezone = 'Japan'
log_line_prefix = '%t [%p]: [%l-1] %h:%u@%d:[XXXX]:CODE:%e '
log_statement = 'all'
log_min_messages = info # DEBUG5
log_min_error_statement = info # DEBUG5
log_error_verbosity = default
log_checkpoints = on
log_lock_waits = on
log_temp_files = 0
log_connections = on
log_disconnections = on
log_duration = off
log_min_duration_statement = 1000
log_autovacuum_min_duration = 3000ms
track_functions = pl
track_activity_query_size = 8192
# Locale/display settings
# -----------------------
lc_messages = 'C'
lc_monetary = 'en_US.UTF-8' # ja_JP.eucJP
lc_numeric = 'en_US.UTF-8' # ja_JP.eucJP
lc_time = 'en_US.UTF-8' # ja_JP.eucJP
timezone = 'Asia/Tokyo'
bytea_output = 'escape'
# Auto vacuum settings
# -----------------------
autovacuum = on
autovacuum_max_workers = 3
autovacuum_vacuum_cost_limit = 200
shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = 10000
auto_explain.log_analyze = on
我的问题是:为什么 target DB C
向 source DB A
询问这些文件,当我在备份服务器中签入时,这些文件在错误发生前 10 分钟被删除了?根据我的阅读,源数据库不会回收或存档任何复制槽所需的文件。也证明了我在target DB C
重新订阅时,pg_wal
的大小增长到了相当程度,并且复制成功。
将归档文件夹中删除文件的时间间隔从 2 分钟延长到 10 分钟真的能解决吗?我不想打和尝试,而是想知道确切的逻辑。
这种出版订阅安排已经持续了一年多,我只遇到过两次这个问题。这可能与数据量大有关吗?
编辑 1: 添加有关当前复制状态的信息
源数据库
sourcedb=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+----------------+---------------------
sub_xxxx_xxxx_targetdbA | pgoutput | logical | 16501 | sourcedb | f | t | 208603 | | 96140914 | 10E82/9404CA50 | 10E82/9E3AFE18
sub_xxxx_xxxx_targetdbB | pgoutput | logical | 16501 | sourcedb | f | t | 208175 | | 96140914 | 10E82/9404CA50 | 10E82/9E3AFDE8
(2 rows)
目标数据库 A
targetdbA=#select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_t
ime
------------+-------------------------+--------+-------+----------------+-------------------------------+------------------------------+----------------+--------------------
-----------
2378695757 | sub_xxxx_xxx_targetdbA | 231891 | | 10E82/ACE84D98 | 2021-06-25 13:15:51.866896+09 | 2021-06-25 13:15:52.39434+09 | 10E82/AB6BA340 | 2021-06-25 13:15:48
.857865+09
(1 row)
目标数据库 B
targetdbB=# select * from pg_stat_subscription;
subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn | latest_end_time
------------+---------------------+-------+-------+----------------+-------------------------------+-------------------------------+----------------+------------------------
-------
3479436453 | sub_xxxx_xxx_targetdbB | 99318 | | 10E83/1D9A8F88 | 2021-06-25 13:20:15.130238+09 | 2021-06-25 13:20:17.868138+09 | 10E83/1D0F2D10 | 2021-06-25 13:20:13.305
298+09
(1 row)
编辑 2: 这只是我的猜测,但不确定是否是真正的原因。我注意到节点 A(源数据库)上的 walsender 不断因以下错误而被杀死。
terminating walsender process due to replication timeout
而且这种情况一天中会发生多次,尤其是在源节点上的数据量很高的时候。
大多数时候它会很快自动重启。接收器节点从 pg_wal
或存档的 WAL 文件夹中获取丢失的文件。但是在这一天(在上面的日志中提到),该过程需要更长的时间才能重新启动,因此可能错过了获取这些文件的机会。
我的 wal_sender_timeout
已设置为 1 分钟(conf 文件显示 5 秒,但稍后更新)。以及目标节点上的以下设置:
postgres=# show wal_receiver_status_interval ;
wal_receiver_status_interval
------------------------------
2s
(1 row)
postgres=# show wal_retrieve_retry_interval ;
wal_retrieve_retry_interval
-----------------------------
5s
(1 row)
为什么 walsender
会反复掉线?是不是因为当数据量很大时,接收/目标节点的响应/确认会丢失?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)