PostgreSQL and Databases in general: 2021

Wednesday, 15 September 2021

Logical Replication Improvements In PostgreSQL-14

In the upcoming release of PostgreSQL-14, we will see multiple enhancements in Logical Replication which I hope will further increase its usage. This blog is primarily to summarize and briefly explain all the enhancements in Logical Replication.

Decoding of large transactions:

Allow streaming large in-progress transactions to subscribers. Before PostgreSQL-14, the transactions were streamed only at commit time which leads to a large apply lag for large transactions. With this feature, we will see apply lag to be reduced, and in certain scenarios that will lead to a big performance win. I have explained this feature in detail in my previous blog.

Performance of logical decoding:

Reduced the CPU usage and improve decoding performance of transactions having a lot of DDLs. It has been observed that decoding of a transaction containing truncation of a table with 1000 partitions would be finished in 1s whereas before this work it used to take 4-5 minutes. Before explaining, how we have achieved this performance gain, let me briefly tell what an invalidation message is in PostgreSQL as that is important to understand this optimization. These are messages to flush invisible system cache entries in each backend session. We normally execute these at the command end in the backend which generated them and send them at the transaction end via a shared queue to other backends for processing. These are normally generated for insert/delete/update operations on system catalogs which happens for DDL operations.

While decoding we use to execute all the invalidations of an entire transaction at each command end as we had no way of knowing which invalidations happened before that command. Due to this, transactions involving large amounts of DDLs use to take more time and also lead to high CPU usage. But now we know specific invalidations at each command end so we execute only required invalidations. This work has been accomplished by commit d7eb52d718.

Initial table sync:

The initial table synchronization involves copying the initial snapshot of the table by the table sync worker and then the table is brought up to a synchronized state with the main apply worker. This whole work use to be done in a single transaction using a temporary replication slot which has major drawbacks: (a) The slot will hold the WAL till the entire sync is complete. (b) Any error during the sync phase will rollback the entire copy which is painful for large copies. (c) There is a risk of exceeding the CID limit.

We did below improvements to overcome the drawbacks:
Allowed multiple transactions in tablesync phase.
Used permanent slots and origins to track the progress of tablesync.

This work has been explained in detail in the blog. This work has been accomplished by commit ce0fdbfe97.

Logical decoding of two-phase commits:

This will allow us to decode the transactions at prepare time and send the same to the output plugin instead of doing it at commit time. This will allow the plugins to decipher the transaction at prepare time and route it to another node if required. This has two advantages (a) allows two-phase distributed transactions across multiple nodes via logical replication, (b) reduces the apply-lag by sending and replaying the transaction on another node at prepare time.

This work has been explained in detail in the blog. This work has been accomplished by commits [1][2][3].

Monitor logical decoding:

Replication slots are used to keep state about replication streams originating from this cluster. Their primary purpose is to prevent the premature removal of WAL. In the context of logical replication, a slot represents a stream of changes that can be replayed to a client in the order they were made on the origin server. Each slot streams a sequence of changes from a single database.

We have added a system view pg_stat_replication_slots to report replication slot activity. This can be used to monitor the amount of data streamed to output plugin or subscriber, spilled to disk. Additionally user can monitor total amount of transaction data decoded for sending transactions to the decoding output plugin while decoding changes from WAL for this slot. Note that this includes data that is streamed and/or spilled. The function pg_stat_reset_replication_slot() resets slot statistics.

Example:

CREATE TABLE stats_test(data text);

SET logical_decoding_work_mem to '64kB';

SELECT 'init' FROM pg_create_logical_replication_slot('slot_stats', 'test_decoding');

INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);

SELECT count(*) FROM pg_logical_slot_peek_changes('slot_stats', NULL, NULL, 'skip-empty-xacts', '1');

SELECT slot_name, spill_txns, spill_count, spill_bytes, total_txns, total_bytes FROM pg_stat_replication_slots;

------------+------------+-------------+-------------+------------+-------------

slot_stats | 1 | 12 | 763893 | 1 | 763893
(1 row)

DROP TABLE stats_test;

SELECT pg_drop_replication_slot('slot_stats');

Allow publications to be easily added and removed:

Currently, if the user needs to add/remove additional publications to a subscription, she needs to mention all the existing publications along with it. Consider a case where a subscription is subscribed to two publications and we want to add an additional publication to it then the user needs to mention all the three (two previous and one new) while doing Alter Subscription. The same is explained with an example below:

Initial Subscription
CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION mypub1, mypub2;

Add a new Publication mypub3.
ALTER SUBSCRIPTION mysub SET PUBLICATION mypub1, mypub2, mypub3;

This could be inconvenient for users especially if there are many existing publications to which a subscription is subscribed. We have added a new way to make this easier by supporting ADD/DROP individual publications. See docs for the syntax. With the new way, in the above case, to add a new publication, the user needs to perform
ALTER SUBSCRIPTION mysub ADD mypub3;

This work has been accomplished by commit 82ed7748b7.

Binary transfer mode:

This feature provides an option during Create/Alter Subscription to allow data from publishers to be sent in binary format. The default value of this option is false. Even when this option is enabled, only data types that have binary send and receive functions will be transferred in binary. When doing cross-version replication, if the subscriber lacks a binary receive function for the type, the data transfer will fail, and this option can't be used. This mode is generally faster. Example to enable binary mode:

CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION mypub1 WITH (binary = true);

This work has been accomplished by commit 9de77b5453.

Allow to get messages via pgoutput:

Provide a “messages” option to the pgoutput plugin. This allows logical decoding messages (i.e. generated via pg_logical_emit_message) to be sent to the slot consumer. This is useful for pgoutput plugin users that use it for Change Data Capture. An example of the same is given below:

SELECT pg_create_logical_replication_slot('pgout_slot','pgoutput');

SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')

SELECT get_byte(data, 1), encode(substr(data, 24, 23), 'escape') FROM pg_logical_slot_peek_binary_changes('pgout_slot', NULL, NULL, 'proto_version', '1', 'publication_names', 'pgout_slot', 'messages', 'true') OFFSET 1 LIMIT 1;

get_byte | encode
----------+-------------------------
1 | a transactional message
(1 row)

SELECT pg_drop_replication_slot('pgout_slot');

While getting changes, the publication_names is not required for logical decoding messages but is specified just so that function doesn't give an error. You can refer to logical replication message formats in the PostgreSQL docs. This work has been accomplished by commit ac4645c015.

Saturday, 17 July 2021

Logical Replication Of In-Progress Transactions

Logical Replication was introduced in PostgreSQL-10 and since then it is being improved with each version. Logical Replication is a method to replicate the data selectively unlike physical replication where the data of the entire cluster is copied. This can be used to build a multi-master or bi-directional replication solution. One of the main differences as compared with physical replication was that it allows replicating the transaction only at commit time. This leads to apply lag for large transactions where we need to wait to transfer the data till the transaction is finished. In the upcoming PostgreSQL-14 release, we are introducing a mechanism to stream the large in-progress transactions. We have seen the replication performance improved by 2 or more times due to this for large transactions especially due to early filtering. See the performance test results reported on hackers and in another blog on the same topic. This will reduce the apply lag to a good degree.

The first thing we need for this feature was to decide when to start streaming the WAL content. One could think if we have such a technology why not stream each change of transaction separately as and when we retrieve it from WAL but that would actually lead to sending much more data across the network because we need to send some additional transaction information with each change so that the apply-side can recognize the transaction to which the change belongs. To address this, in PostgreSQL-13, we have introduced a new GUC parameter logical_decoding_work_mem which allows users to specify the maximum amount of memory to be used by logical decoding, before which some of the decoded changes are either written to local disk or stream to the subscriber. The parameter is also used to control the memory used by logical decoding as explained in the blog.

The next thing that prevents incremental decoding was the delay in finding the association of subtransaction and top-level XID. During logical decoding, we accumulate all changes along with its (sub)transaction. Now, while sending the changes to the output plugin or stream to the other node, we need to combine all the changes that happened in the transaction which requires us to find the association of each top-level transaction with its subtransactions. Before PostgreSQL-14, we build this association at XLOG_XACT_ASSIGNMENT WAL record which we normally log after 64 subtransactions or at commit time because these are the only two times when we get such an association in the WAL. To find this association as it happened, we now also write the assignment info into WAL immediately, as part of the first WAL record for each subtransaction. This is done only when wal_level=logical to minimize the overhead.

Yet, another thing that is required for incremental decoding was to process invalidations at each command end. The basic idea of invalidations is that they make the caches (like relation cache) up-to-date to allow the next command to use up-to-date schema. This was required to correctly decode WAL incrementally as while decoding we will use the relation attributes from the caches. For this, when wal_level=logical, we write invalidations at the command end into WAL so that decoding can use this information. The invalidations are decoded and accumulated in top-transaction, and then executed during replay. This obviates the need to decode the invalidations as part of a commit record.

In previous paragraphs, the enhancements required in the server infrastructure to allow incremental decoding are explained. The next step was to provide APIs (stream methods) for out-of-core logical replication to stream large in-progress transactions. We added seven methods to the output plugin API to allow this. Those are: (stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb and stream_change_cb) and two optional callbacks (stream_message_cb and stream_truncate_cb). For details about these APIs, refer to PostgreSQL docs.

When streaming an in-progress transaction, the changes (and messages) are streamed in blocks demarcated by stream_start_cb and stream_stop_cb callbacks. Once all the decoded changes are transmitted, the transaction can be committed using the stream_commit_cb callback (or possibly aborted using the stream_abort_cb callback). One example sequence of streaming transaction may look like the following:

/* Change logical_decoding_work_mem to 64kB in the session */
postgres=# show logical_decoding_work_mem;
logical_decoding_work_mem
---------------------------
64kB
(1 row)
postgres=# CREATE TABLE stream_test(data text);
CREATE TABLE

postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
?column?
----------
init
(1 row)
postgres=# INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 500) g(i);
INSERT 0 500
postgres=# SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '1', 'skip-empty-xacts', '1', 'stream-changes', '1');
data
--------------------------------------------------
opening a streamed block for transaction TXN 741
streaming change for TXN 741
streaming change for TXN 741
streaming change for TXN 741
...
...
streaming change for TXN 741
streaming change for TXN 741
streaming change for TXN 741
closing a streamed block for transaction TXN 741
opening a streamed block for transaction TXN 741
streaming change for TXN 741
streaming change for TXN 741
streaming change for TXN 741
...
...
streaming change for TXN 741
streaming change for TXN 741
closing a streamed block for transaction TXN 741
committing streamed transaction TXN 741
(505 rows)

The actual sequence of callback calls may be more complicated depending on the server operations. There may be blocks for multiple streamed transactions, some of the transactions may get aborted, etc.

Note that streaming is triggered when the total amount of changes decoded from the WAL (for all in-progress transactions) exceeds the limit defined by the logical_decoding_work_mem setting. At that point, the largest top-level transaction (measured by the amount of memory currently used for decoded changes) is selected and streamed. However, in some cases we still have to spill to disk even if streaming is enabled because we exceed the memory threshold but still have not decoded the complete tuple e.g., only decoded toast table insert but not the main table insert or decoded speculative insert but not the corresponding confirm record. However, as soon as we get the complete tuple we stream the transaction including the serialized changes.

While streaming in-progress transactions, the concurrent aborts may cause failures when the output plugin (or decoding of WAL records) consults catalogs (both system and user-defined). Let me explain this with an example, suppose there is one catalog tuple with (xmin: 500, xmax: 0). Now, the transaction 501 updates the catalog tuple and after that we will have two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0). Now, if 501 is aborted and some other transaction say 502 updates the same catalog tuple then the first tuple will be changed to (xmin: 500, xmax: 502). So, the problem is that when we try to decode the tuple inserted/updated in 501 after the catalog update, we will see the catalog tuple with (xmin: 500, xmax: 502) as visible because it will consider that the tuple is deleted by xid 502 which is not visible to our snapshot. And when we will try to decode with that catalog tuple, it can lead to a wrong result or a crash. So, it is necessary to detect concurrent aborts to allow streaming of in-progress transactions. For detecting the concurrent abort, during catalog scan we can check the status of the xid and if it is aborted we will report a specific error so that we can stop streaming current transaction and discard the already streamed changes on such an error. We might have already streamed some of the changes for the aborted (sub)transaction, but that is fine because when we decode the abort we will stream the abort message to truncate the changes in the subscriber.

To add support for streaming of in-progress transactions into the built-in logical replication, we need to primarily do four things:

(a) Extend the logical replication protocol to identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). Refer to PostgreSQL docs for the protocol details.

(b) Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol.

(c) Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit.

(d) Provide a new option for streaming while creating a subscription.

The below example demonstrates how to set up the streaming via built-in logical replication:

Publisher node:

Set logical_decoding_work_mem = '64kB';
# Set up publication with some initial data

CREATE TABLE test_tab (a int primary key, b varchar);
INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar');
CREATE PUBLICATION tap_pub FOR TABLE test_tab;

Subscriber node:

CREATE TABLE test_tab (a int primary key, b varchar);
CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION tap_pub WITH (streaming = on);

Publisher Node:

# Ensure the corresponding replication slot is created on publisher node
select slot_name, plugin, slot_type from pg_replication_slots;
slot_name | plugin | slot_type
-----------+----------+-----------
tap_sub | pgoutput | logical
(1 row)

# Confirm there is no streamed bytes yet
postgres=# SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots;
slot_name | stream_txns | stream_count | stream_bytes
-----------+-------------+--------------+--------------
tap_sub | 0 | 0 | 0
(1 row)

# Insert, update and delete enough rows to exceed the logical_decoding_work_mem (64kB) limit.
BEGIN;
INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);
UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;
DELETE FROM test_tab WHERE mod(a,3) = 0;

# Confirm that streaming happened
SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots;
slot_name | stream_txns | stream_count | stream_bytes
-----------+-------------+--------------+--------------
tap_sub | 1 | 22 | 1444410
(1 row)

Subscriber Node:
# The streamed data is still not visible.
select * from test_tab;
a | b
---+-----
1 | foo
2 | bar
(2 rows)

Publisher Node:
# Commit the large transactions
Commit;

Subscriber Node:
# The data must be visible on the subscriber
select count(*) from test_tab;
count
-------
3334
(1 row)

This feature was proposed in 2017 and committed in 2020 as part of various commits 0bead9af48, c55040ccd0, 45fdc9738b, 7259736a6e, and 464824323e. It took a long time to complete this feature because of the various infrastructure pieces required to achieve this. I would really like to thank all the people involved in this feature especially Tomas Vondra who has initially proposed it and then Dilip Kumar who along with me had completed various remaining parts and made it a reality. Then also to other people like Neha Sharma, Mahendra Singh Thalor, Ajin Cherian, and Kuntal Ghosh who helped throughout the project to do reviews and various tests. Also, special thanks to Andres Freund and other community members who have suggested solutions to some of the key problems of this feature. Last but not least, thanks to EDB and Fujitsu's management who encouraged me and some of the other members to work on this feature.