Monday 25 September 2023

Evolution of Logical Replication

This blog post is about how the Logical Replication has evolved over the years in PostgreSQL, what's in the latest release-16, and what's being worked upon for future releases. Logical replication is a method of replicating data objects and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. To know more about Logical Replication in PostgreSQL, read pgsql-docs.

The foundation of Logical Replication has been laid in PostgreSQL-9.4 where we introduced Logical Decoding (that allows database changes to be streamed out in a customizable format), Replication Slots (that allow preservation of resources like WAL files on the primary until they are no longer needed by standby servers), and Replica Identity (a table level option to control operations for logical replication) have been introduced.

With PostgreSQL-9.5, we introduced features like tracking replication progress via origins, and commit timestamps that will allow conflict resolution in the future. The replication origins allow to restart of the replication from the point where it was before the restart. The replication origins also help us avoid bi-directional loops in the replication setup, a feature introduced in the latest PostgreSQL 16 release.

With PostgreSQL-9.6, we added support for generic WAL messages for logical decoding. This feature allows extensions to insert data into the WAL stream that can be read by logical-decoding plugins.

With PostgreSQL-10.0, we introduced native Logical Replication using the publish/subscribe model. This allows more flexibility than physical replication does, including replication between different major versions of PostgreSQL and selective replication. In this release, we allowed Insert/Update/Delete operations on tables to be replicated. Going forward from here, we have enhanced this feature with each release.

With PostgreSQL-11.0, we allowed replication of the Truncate operation. Logical decoding started to use a generational memory allocator which is optimized for serial allocation/deallocation and leads to a reduction in memory usage for decoding.

With PostgreSQL-12.0, we allowed logical replication slots to be copied. This functionality is helpful when doing an investigation of logical replication issues; and to select a different output plugin to consume changes.

With PostgreSQL-13.0, we introduced replication of partitioned tables which allowed all its partitions to be replicated, prior to this individual partitions need to be replicated. Then we allowed to control the amount of memory used for logical decoding via GUC logical_decoding_work_mem. If you want to learn more about this parameter, read blog. Then we also allowed WAL storage for replication slots to be limited by GUC max_slot_wal_keep_size.

With PostgreSQL-14.0, we introduced streaming of in-progress transactions which reduced the apply lag for large transactions. Then we allowed decoding of prepared transactions which means output plugins can consume changes at prepare time instead of waiting till commit helping in reducing the lag. Then, we also reduced the CPU usage and improved the decoding performance of transactions having a lot of DDLs. There is a long list of features that have been introduced in 14.0, you can read blog for more details.

With PostgreSQL-15.0, we allowed replication of prepared transactions for in-core logical replication which apart from reducing apply lag forms the basis of building conflict-free logical replication. Then users can use row filters and column lists to send selective table data. We also allowed all the tables in the Schema to be published. You can read blog to learn more about the logical replication features introduced in 15.0.

Now let's discuss the features introduced in the latest PostgreSQL-16.0 release.

Prevent loops in bi-directional replication

Today, we are allowed to set up a bi-directional replication but it has problems like conflicts can occur, and such a setup can lead to bi-directional loops. This feature intends to solve the second problem. Let me share a simple example to show how loops can occur.

Publisher
CREATE TABLE mytbl(c1 int primary key);
CREATE PUBLICATION mypub FOR TABLE mytbl;

Subscriber
CREATE TABLE mytbl(c1 int primary key);
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres' PUBLICATION mypub;
CREATE PUBLICATION mypub FOR TABLE mytbl;

Publisher
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres port=5444' PUBLICATION mypub;
INSERT INTO t1 values(1);

Publisher's server LOG:
ERROR:  duplicate key value violates unique constraint "t1_pkey"
DETAIL: Key (c1)=(1) already exists.

We see this error because the subscriber node after applying the change, sent it back to the publisher. Now, if there was no primary key defined on mytbl, we would have again inserted the same row on the publisher and sent it again to the subscriber. So, this would have led to an infinite loop between publisher and subscriber.

We have introduced origin filtering which avoids such a problem. Users need to perform the following statement on the publisher in the above example to avoid loop.

ALTER SUBSCRIPTION mysub SET(origin=none);

The following is the syntax to use this feature:

CREATE SUBSCRIPTION sub1 CONNECTION ... PUBLICATION pub1 WITH (origin = none);

The valid values are 'none' and 'any' with later as default. Setting the origin to 'none' means that the subscription will request the publisher to only send changes that don't have an origin. Setting the origin to 'any' means that the publisher sends changes regardless of their origin. For more information on this feature, you can read blogs [1][2].

Allowed logical decoding to be performed from the standby server

This requires wal_level = logical on both primary and standby. Let us see with an example:

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)
The above is to show the following statements are executed on standby.

postgres=# select * from pg_create_logical_replication_slot('slot_1', 'test_decoding', false, false);
 slot_name |    lsn
-----------+-----------
 slot_1    | 0/50001A0
(1 row)
postgres=# SELECT * FROM pg_logical_slot_get_changes('slot_1', NULL, NULL);
    lsn    | xid |                  data
-----------+-----+----------------------------------------
 0/5000250 | 734 | BEGIN 734
 0/5000250 | 734 | table public.t1: INSERT: c1[integer]:3
 0/5000348 | 734 | COMMIT 734
(3 rows)

This ability can be used for workload distribution, by allowing subscribers to subscribe from the standby when the primary is busy. We need to ensure that we can't decode from slots when the required data is removed in the primary. So, we invalidate logical slots on standby (a) when the required rows are removed on primary, and (b) when the wal_level on the primary server is reduced to below logical. We can check the 'conflicting' field in pg_replication_slots to know if the slot is invalidated due to conflict.

Perform operations with table owner's permission

The apply process can be configured to perform operations with the table owner's privileges instead of the subscription owner's privileges. It can be configured with the option 'run_as_owner' as 'false' and the following syntax can be used for the same:

CREATE SUBSCRIPTION mysub CONNECTION … PUBLICATION mypub WITH (run_as_owner = false);

The subscription owner needs to be able to SET ROLE to each role that owns a replicated table. If the table owner doesn't have permission to SET ROLE to the subscription, SECURITY_RESTRICTED_OPERATION is imposed.

If the subscription has been configured with run_as_owner=true, then no user switching will occur. This also means that any user who owns a table into which replication is happening can execute arbitrary code with the privileges of the subscription owner.

The default value of 'run_as_owner' is false which is generally more secure.

Non-superusers can create subscriptions

For that, the non-superusers must have been granted pg_create_subscription role, and are required to specify a password for authentication. Non-superusers must additionally have CREATE permissions on the database in which the subscription is to be created.

A very basic example to show the above steps:

postgres=# create user u1;
CREATE ROLE
postgres=# Grant pg_create_subscription to u1;
GRANT ROLE
postgres=# Grant Create on database postgres to u1;
GRANT
postgres=> set session authorization u1;
SET
postgres=> create subscription sub1 connection 'dbname=postgres password=p1' publication pub1 with (connect=false);
WARNING:  subscription was created, but is not connected
HINT:  To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.
CREATE SUBSCRIPTION

Note that, superusers can set password_required = false for non-superusers that own the subscription.

Large transactions can be applied in parallel

CREATE SUBSCRIPTION mysub CONNECTION …  PUBLICATION mypub WITH (streaming = parallel);

Performance improvement in the range of 25-40% has been observed (for further details, check here).

Each large transaction is assigned to one of the available workers, which improves lag by immediately applying instead of waiting till the whole transaction is received by the subscriber. The worker remains assigned until the transaction is completed. This preserves commit ordering and avoids file I/O in most cases, although we still need to spill to a file if there is no worker available.

It is important to maintain commit order to avoid failures due to: (a) transaction dependencies - say if we insert a row in the first transaction and update it in the second transaction on publisher then allowing the subscriber to apply both in parallel can lead to failure in the update; (b) deadlocks - allowing transactions that update the same set of rows/tables in the opposite order to be applied in parallel can lead to deadlocks.

max_parallel_apply_workers_per_subscription sets the maximum number of parallel apply workers per subscription.

Logical replication can copy tables in binary format

CREATE SUBSCRIPTION mysub CONNECTION … PUBLICATION mypub WITH (binary = true);

Copying tables in binary format may reduce the time spent, depending on column types. Prior to V16, this option only allows replication to replicate tables in binary format. A binary copy is supported only when both publisher and subscriber are v16 or later.

Indexes other than PK and REPLICA IDENTITY can be used on the subscriber

Using REPLICA IDENTITY FULL on the publisher can lead to a full table scan per tuple change on the subscriber when the REPLICA IDENTITY or PK index is not available. This feature allows us to consider the other indexes present.

The index that can be used must be a btree index, not a partial index, and the leftmost field must be a column (not an expression) that references the remote relation column. These limitations help to keep the index scan similar to PK/RI index scans.

The performance improvement is proportional to the amount of data in the table.

The last topic I wanted to cover in this blog is the features in this area that are being worked upon or are being discussed in the community for future releases.

What's next:

1. Upgrade of logical replication nodes. This will allow a seamless upgrade of logical replication nodes. One of the biggest challenges as of now is we don't migrate slots across versions due to which users need to build most of the setup again See, one of the examples in blog.

2. Synchronization of replication slots to allow failover. This will perform synchronization of logical slots from publisher to physical standbys. After this feature, if the publisher goes down, we can promote its physical standby, and then the subscribers can connect to it and work.

3. Replication of sequences.

4. DDL Replication. This is a relatively large feature but is one of the most asked features to allow seamless major version upgrades.