Replication shinanigans

In Debian, package upgrades trigger restart of the related service.
After I’ve restarted all my postgresql nodes (1 primary + 2 hot standby) at the same time, I’ve realized one of the secondaries were not able to process WAL files any further.

Checking the logs gave me an idea:

2019-07-09 13:20:32 UTC [25065-1] postgres@[local] FATAL:  the database system is starting up

gzip: stdin: unexpected end of file
ERROR: Remote 'barman get-wal' command has failed!
2019-07-09 13:20:32 UTC [25066-1] LOG:  started streaming WAL from primary at B160/F000000 on timeline 1
2019-07-09 13:20:32 UTC [25066-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000B1600000000F has already been removed
2019-07-09 13:20:33 UTC [25071-1] loadbalancer@ FATAL:  the database system is starting up

Sweet! While primary is responding with “I don’t have the WAL files you’re requesting anymore!”, it’s also failing to fetch the files from the configured archive (barman) for some reason.

After fixing the permission issue about barman, secondary server started to fetch WAL files from it:

2019-07-09 13:26:46 UTC [29161-4] LOG:  restored log file "000000010000B1600000000F" from archive

Yet, the delay between primary didn’t close. Something weird was going on…
Let’s check how many WAL files we’re behind. On the primary server:

postgres=# show max_wal_size ;
| max_wal_size |
| 50GB         |
(1 row)

postgres=# show wal_keep_segments ;
| wal_keep_segments |
| 10                |
(1 row)

So… We should have quite a bit of WAL files on this server. Let’s look at the size and count:

primary ~ ~> sudo du -sh /var/lib/postgresql/11/main/pg_wal
15G     /var/lib/postgresql/11/main/pg_wal
primary ~ ~> sudo ls /var/lib/postgresql/11/main/pg_wal | wc -l

Yeah.. We have a lot of those. So why is our secondary unable to close the gap?

The issue was WAL/second rate: I’ve discovered there is a new WAL file created on this server approximately each second.

A WAL file is 16Mb by default (you can change it on compile time) and it’s shipped to the archive once it’s filled (via configured archive_command).
That means, on this busy database, a 16Mb file produced and sent to the archive each second.
After ~20 minutes of standby downtime, primary dropped the WAL files from its pool (since they’re successfully archived) and secondary server was trying to fetch these files from barman. Which is bound to fail by definition, since this is done over network. Even though it’s a big server, downloading and processing 16Mb was taking about ~2 seconds.

So I’ve re-cloned the standby from primary (without using barman) and started it within 15 minutes :)

Even then, the gap wasn’t closing. WHAT. THE. F.

Turns out postgresql checks the archive server first (if it’s configured) and only asking to primary if this command fails.
There is no way to prevent that behavior yet (except disabling restore_command completely or failing it quickly). So I’ve disabled the restore command and primary happily provided requested WAL files within its reach.

They weren’t anticipating irrationally busy workloads like this I assume. We’ve notified postgres mailing list, but I don’t think this will be a quick fix :)