Thursday 3 October 2024

Failover Slots in PostgreSQL-17: Ensuring High Availability with Logical Replication

With the introduction of failover slots in PostgreSQL-17, logical replication can now be effectively utilized in high-availability environments. This feature ensures that logical replication continues seamlessly even if the publisher node fails and its physical standby is promoted as the new publisher.

To maintain continuous replication after a failover, you need to configure the following parameters and options for both logical and physical replication setups:

1. Enable Failover Property on Slots:

•  Set the failover property for the slots corresponding to the subscriptions that need to continue receiving changes after a failover. This can be done using the failover option in CREATE SUBSCRIPTION or the failover parameter in the pg_create_logical_replication_slot() SQL API. These slots will be referred to as failover slots.

2. Synchronize Replication Slots:

•  Set the sync_replication_slots parameter to on on the standby. This ensures that the failover slots are periodically synchronized by the slotsync worker.

3. Configure Physical Replication Slot:

•  Ensure a physical replication slot exists between the primary and the standby. Set primary_slot_name on the standby and ensure the slot with that name exists on the primary. Also, enable hot_standby_feedback on the standby to prevent row removal needed by the standby.

4. Prevent Subscriber Overrun:

•  On the primary, set the synchronized_standby_slots parameter to include the physical replication slot. This prevents the subscriber from consuming changes faster than the hot standby can handle.

5. Update Connection Info:

•  Modify primary_conninfo on the standby to include dbname=postgres. This is necessary for the slot sync worker to connect to the primary.

Example Setup

Here's an example demonstrating logical replication continuity after failover using three nodes: Node-A (primary server/publisher), Node-B (physical standby for Node-A), and Node-C (subscriber for Node-A).

Node-A (Publisher)
Initialize the publisher database cluster:
initdb -D <data_pub_path>

Start the server:
pg_ctl start -D <data_pub_path>

Set wal_level to logical.

Create a table and publication:
CREATE TABLE mytbl(c1 INT PRIMARY KEY);
CREATE PUBLICATION mypub FOR TABLE mytbl;

Node-C (Subscriber)
Initialize the subscriber database cluster:
initdb -D <data_sub_path>

Start the server:
pg_ctl start -o "-p 5444" -D <data_sub_path>

Create the table and a failover-enabled subscription:
CREATE TABLE mytbl(c1 INT PRIMARY KEY);
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres' PUBLICATION mypub WITH (failover = true);

Verify Replication

Node-A (Publisher):
SELECT slot_name, failover FROM pg_replication_slots;
 slot_name | failover
-----------+----------
 mysub     | t

INSERT INTO mytbl VALUES (generate_series(1, 2));

On Node-C (Subscriber):
SELECT * FROM mytbl;
 c1
----
  1
  2

Setting Up Physical Standby for Failover
Node-B (Physical Standby)
Create a base backup:
pg_basebackup -D <data_standby_path> -R -X stream -P

Start the standby server:
pg_ctl start -o "-p 5555" -D <data_standby_path>

Node-A (Publisher/Primary)
Create a physical replication slot:
SELECT pg_create_physical_replication_slot('sb1_slot');

Node-B (Physical Standby)
Set the following parameters:
ALTER SYSTEM SET sync_replication_slots = on;
ALTER SYSTEM SET primary_slot_name = 'sb1_slot';
ALTER SYSTEM SET hot_standby_feedback = on;
 
Restart the standby server and ensure the failover slot is synced:
SELECT slot_name, failover, synced FROM pg_replication_slots;
select slot_name, failover, synced from pg_replication_slots;
 slot_name | failover | synced
-----------+----------+--------
 mysub     | t        | t 

Testing Failover
Stop the primary server (Node-A):
pg_ctl stop -D <data_pub_path>

Promote the standby server (Node-B):
pg_ctl promote -D <data_standby_path>

Update Subscription on Node-C (Subscriber)
Disable the subscription:
ALTER SUBSCRIPTION mysub DISABLE;

Change the connection to the new primary:
ALTER SUBSCRIPTION mysub CONNECTION 'dbname=postgres port=5555';
 
Enable the subscription:
ALTER SUBSCRIPTION mysub ENABLE;

Verify Continuity

On Node-B (New Primary):
INSERT INTO mytbl VALUES (generate_series(3, 4));

On Node-C (Subscriber):
SELECT * FROM mytbl;
 c1
----
  1
  2
  3
  4

Under the Hood
Slot synchronization on a physical standby works by fetching logical failover slot information from the primary server, creating the slots on the standby, and synchronizing them periodically. This can be done automatically by enabling the slot sync worker or manually using the pg_sync_replication_slots() function.

If the WAL corresponding to the remote's restart_lsn is unavailable or the remote's catalog_xmin precedes the "oldest xid," the local standby slot cannot be created, and the slot will be marked as RS_TEMPORARY. Once the primary server catches up, the slot will be marked as RS_PERSISTENT.

The slot sync worker waits before the next synchronization, with the duration varying based on whether any slots were updated during the last cycle. Standby synchronized slots will be dropped if they no longer need synchronization.