tag:blogger.com,1999:blog-86737527701467765752024-03-13T03:44:25.806-07:00PostgreSQL and Databases in generalAmit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.comBlogger16125tag:blogger.com,1999:blog-8673752770146776575.post-50988030895699065152023-09-25T05:47:00.003-07:002023-09-25T05:52:56.375-07:00Evolution of Logical Replication<p>This blog post is about how the Logical Replication has evolved over the years in PostgreSQL, what's in the latest <a href="https://www.postgresql.org/docs/16/release-16.html" target="_blank">release-16</a>, and what's being worked upon for future releases. Logical replication is a method of replicating data objects and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. To know more about Logical Replication in PostgreSQL, read <a href="https://www.postgresql.org/docs/current/logical-replication.html" target="_blank">pgsql-docs</a>.</p><p>The foundation of Logical Replication has been laid in PostgreSQL-9.4 where we introduced Logical Decoding (that allows database changes to be streamed out in a customizable format), Replication Slots (that allow preservation of resources like WAL files on the primary until they are no longer needed by standby servers), and Replica Identity (a table level option to control operations for logical replication) have been introduced.</p><p>With PostgreSQL-9.5, we introduced features like tracking replication progress via origins, and commit timestamps that will allow conflict resolution in the future. The replication origins allow to restart of the replication from the point where it was before the restart. The replication origins also help us avoid bi-directional loops in the replication setup, a feature introduced in the latest PostgreSQL 16 release.</p><p>With PostgreSQL-9.6, we added support for generic WAL messages for logical decoding. This feature allows extensions to insert data into the WAL stream that can be read by logical-decoding plugins.</p><p>With PostgreSQL-10.0, we introduced native Logical Replication using the publish/subscribe model. This allows more flexibility than physical replication does, including replication between different major versions of PostgreSQL and selective replication. In this release, we allowed Insert/Update/Delete operations on tables to be replicated. Going forward from here, we have enhanced this feature with each release.</p><p>With PostgreSQL-11.0, we allowed replication of the Truncate operation. Logical decoding started to use a generational memory allocator which is optimized for serial allocation/deallocation and leads to a reduction in memory usage for decoding.</p><p>With PostgreSQL-12.0, we allowed logical replication slots to be copied. This functionality is helpful when doing an investigation of logical replication issues; and to select a different output plugin to consume changes.</p><p>With PostgreSQL-13.0, we introduced replication of partitioned tables which allowed all its partitions to be replicated, prior to this individual partitions need to be replicated. Then we allowed to control the amount of memory used for logical decoding via GUC <a href="https://www.postgresql.org/docs/13/runtime-config-resource.html#GUC-LOGICAL-DECODING-WORK-MEM" target="_blank">logical_decoding_work_mem</a>. If you want to learn more about this parameter, read <a href="https://www.enterprisedb.com/postgres-tutorials/postgres-13-logicaldecodingworkmem-and-how-it-saves-your-server-going-out-memory" target="_blank">blog</a>. Then we also allowed WAL storage for replication slots to be limited by GUC <a href="https://www.postgresql.org/docs/13/runtime-config-replication.html#GUC-MAX-SLOT-WAL-KEEP-SIZE" target="_blank">max_slot_wal_keep_size</a>.</p><p>With PostgreSQL-14.0, we introduced <a href="http://amitkapila16.blogspot.com/2021/07/logical-replication-of-in-progress.html" target="_blank">streaming of in-progress transactions</a> which reduced the apply lag for large transactions. Then we allowed decoding of prepared transactions which means output plugins can consume changes at prepare time instead of waiting till commit helping in reducing the lag. Then, we also reduced the CPU usage and improved the decoding performance of transactions having a lot of DDLs. There is a long list of features that have been introduced in 14.0, you can read <a href="http://amitkapila16.blogspot.com/2021/09/logical-replication-improvements-in.html" target="_blank">blog</a> for more details.</p><p>With PostgreSQL-15.0, we allowed replication of prepared transactions for in-core logical replication which apart from reducing apply lag forms the basis of building conflict-free logical replication. Then users can use row filters and column lists to send selective table data. We also allowed all the tables in the Schema to be published. You can read <a href="http://amitkapila16.blogspot.com/2022/11/logical-replication-improvements-in.html" target="_blank">blog</a> to learn more about the logical replication features introduced in 15.0.</p><p>Now let's discuss the features introduced in the latest PostgreSQL-16.0 release.</p><p><i><u><b>Prevent loops in bi-directional replication</b></u></i></p><p>Today, we are allowed to set up a bi-directional replication but it has problems like conflicts can occur, and such a setup can lead to bi-directional loops. This feature intends to solve the second problem. Let me share a simple example to show how loops can occur.</p><div style="text-align: left;">Publisher<br />CREATE TABLE mytbl(c1 int primary key);<br />CREATE PUBLICATION mypub FOR TABLE mytbl;</div><div style="text-align: left;"><br />Subscriber<br />CREATE TABLE mytbl(c1 int primary key);<br />CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres' PUBLICATION mypub;<br />CREATE PUBLICATION mypub FOR TABLE mytbl;<br /><br /></div><div style="text-align: left;">Publisher<br />CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres port=5444' PUBLICATION mypub;<br />INSERT INTO t1 values(1);<br /><br /></div><div style="text-align: left;">Publisher's server LOG:<br />ERROR: duplicate key value violates unique constraint "t1_pkey"<br />DETAIL: Key (c1)=(1) already exists.</div><p>We see this error because the subscriber node after applying the change, sent it back to the publisher. Now, if there was no primary key defined on mytbl, we would have again inserted the same row on the publisher and sent it again to the subscriber. So, this would have led to an infinite loop between publisher and subscriber.</p><p>We have introduced origin filtering which avoids such a problem. Users need to perform the following statement on the publisher in the above example to avoid loop.</p><p>ALTER SUBSCRIPTION mysub SET(origin=none);</p><p>The following is the syntax to use this feature:</p><p>CREATE SUBSCRIPTION sub1 CONNECTION ... PUBLICATION pub1 WITH (origin = none);</p><p>The valid values are 'none' and 'any' with later as default. Setting the origin to 'none' means that the subscription will request the publisher to only send changes that don't have an origin. Setting the origin to 'any' means that the publisher sends changes regardless of their origin. For more information on this feature, you can read blogs [<a href="https://www.postgresql.fastware.com/blog/bi-directional-replication-using-origin-filtering-in-postgresql" target="_blank">1</a>][<a href="https://www.crunchydata.com/blog/active-active-postgres-16" target="_blank">2</a>].</p><p><b><i><u>Allowed logical decoding to be performed from the standby server</u></i></b></p><p> This requires wal_level = logical on both primary and standby. Let us see with an example:</p><div style="text-align: left;">postgres=# select pg_is_in_recovery();<br /> pg_is_in_recovery<br />-------------------<br /> t<br />(1 row)</div><div style="text-align: left;">The above is to show the following statements are executed on standby.</div><div style="text-align: left;"><br />postgres=# select * from pg_create_logical_replication_slot('slot_1', 'test_decoding', false, false);<br /> slot_name | lsn<br />-----------+-----------<br /> slot_1 | 0/50001A0<br />(1 row)</div><div style="text-align: left;">postgres=# SELECT * FROM pg_logical_slot_get_changes('slot_1', NULL, NULL);<br /> lsn | xid | data<br />-----------+-----+----------------------------------------<br /> 0/5000250 | 734 | BEGIN 734<br /> 0/5000250 | 734 | table public.t1: INSERT: c1[integer]:3<br /> 0/5000348 | 734 | COMMIT 734<br />(3 rows)</div><p>This ability can be used for workload distribution, by allowing subscribers to subscribe from the standby when the primary is busy. We need to ensure that we can't decode from slots when the required data is removed in the primary. So, we invalidate logical slots on standby (a) when the required rows are removed on primary, and (b) when the wal_level on the primary server is reduced to below logical. We can check the 'conflicting' field in pg_replication_slots to know if the slot is invalidated due to conflict.</p><p><b><i><u>Perform operations with table owner's permission</u></i></b></p><p>The apply process can be configured to perform operations with the table owner's privileges instead of the subscription owner's privileges. It can be configured with the option 'run_as_owner' as 'false' and the following syntax can be used for the same:</p><p>CREATE SUBSCRIPTION mysub CONNECTION … PUBLICATION mypub WITH (run_as_owner = false);</p><p>The subscription owner needs to be able to SET ROLE to each role that owns a replicated table. If the table owner doesn't have permission to SET ROLE to the subscription, SECURITY_RESTRICTED_OPERATION is imposed.</p><p>If the subscription has been configured with run_as_owner=true, then no user switching will occur. This also means that any user who owns a table into which replication is happening can execute arbitrary code with the privileges of the subscription owner.</p><p>The default value of 'run_as_owner' is false which is generally more secure.</p><p><b><i><u>Non-superusers can create subscriptions</u></i></b></p><p>For that, the non-superusers must have been granted pg_create_subscription role, and are required to specify a password for authentication. Non-superusers must additionally have CREATE permissions on the database in which the subscription is to be created.</p><p>A very basic example to show the above steps:</p><div style="text-align: left;">postgres=# create user u1;<br />CREATE ROLE<br />postgres=# Grant pg_create_subscription to u1;<br />GRANT ROLE<br />postgres=# Grant Create on database postgres to u1;<br />GRANT</div><div style="text-align: left;"><div>postgres=> set session authorization u1;</div><div>SET</div>postgres=> create subscription sub1 connection 'dbname=postgres password=p1' publication pub1 with (connect=false);<br />WARNING: subscription was created, but is not connected<br />HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription.</div><div style="text-align: left;">CREATE SUBSCRIPTION</div><p>Note that, superusers can set password_required = false for non-superusers that own the subscription.</p><p><b><i><u>Large transactions can be applied in parallel</u></i></b></p><p>CREATE SUBSCRIPTION mysub CONNECTION … PUBLICATION mypub WITH (streaming = parallel);</p><p>Performance improvement in the range of 25-40% has been observed (for further details, check <a href="https://www.postgresql.org/message-id/CAJpy0uBm0%2ByZs%2B7emKCp2%2BRdvA3Gy_SW0aLfntfHvcEiWq_5Ew%40mail.gmail.com" target="_blank">here</a>).</p><p>Each large transaction is assigned to one of the available workers, which improves lag by immediately applying instead of waiting till the whole transaction is received by the subscriber. The worker remains assigned until the transaction is completed. This preserves commit ordering and avoids file I/O in most cases, although we still need to spill to a file if there is no worker available.</p><p>It is important to maintain commit order to avoid failures due to: (a) transaction dependencies - say if we insert a row in the first transaction and update it in the second transaction on publisher then allowing the subscriber to apply both in parallel can lead to failure in the update; (b) deadlocks - allowing transactions that update the same set of rows/tables in the opposite order to be applied in parallel can lead to deadlocks.</p><p>max_parallel_apply_workers_per_subscription sets the maximum number of parallel apply workers per subscription.</p><p><b><u><i>Logical replication can copy tables in binary format</i></u></b></p><p>CREATE SUBSCRIPTION mysub CONNECTION … PUBLICATION mypub WITH (binary = true);</p><p>Copying tables in binary format may reduce the time spent, depending on column types. Prior to V16, this option only allows replication to replicate tables in binary format. A binary copy is supported only when both publisher and subscriber are v16 or later.</p><p><b><u><i>Indexes other than PK and REPLICA IDENTITY can be used on the subscriber</i></u></b></p><p>Using REPLICA IDENTITY FULL on the publisher can lead to a full table scan per tuple change on the subscriber when the REPLICA IDENTITY or PK index is not available. This feature allows us to consider the other indexes present.</p><p>The index that can be used must be a btree index, not a partial index, and the leftmost field must be a column (not an expression) that references the remote relation column. These limitations help to keep the index scan similar to PK/RI index scans.</p><p>The performance improvement is proportional to the amount of data in the table.</p><p>The last topic I wanted to cover in this blog is the features in this area that are being worked upon or are being discussed in the community for future releases.</p><p><b><u><i>What's next:</i></u></b></p><p>1. <b><i><u>Upgrade of logical replication nodes.</u></i></b> This will allow a seamless upgrade of logical replication nodes. One of the biggest challenges as of now is we don't migrate slots across versions due to which users need to build most of the setup again See, one of the examples in <a href="https://elephanttamer.net/?p=58" target="_blank">blog</a>.</p><p>2. <b style="font-style: italic; text-decoration-line: underline;">Synchronization of replication slots to allow failover.</b> This will perform synchronization of logical slots from publisher to physical standbys. After this feature, if the publisher goes down, we can promote its physical standby, and then the subscribers can connect to it and work.</p><p>3. <b><u><i>Replication of sequences.</i></u></b></p><p>4. <b><u><i>DDL Replication.</i></u></b> This is a relatively large feature but is one of the most asked features to allow seamless major version upgrades.</p>Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com0tag:blogger.com,1999:blog-8673752770146776575.post-15219494707405868852022-11-07T21:38:00.000-08:002022-11-07T21:38:14.371-08:00Logical Replication Improvements in PostgreSQL-15<p>There are various areas in PostgreSQL like Partitioning, Logical Replication, Parallel Query, Vacuum, etc. which improve with each new version. In this blog, I'll summarize the various enhancements in <a href="https://www.postgresql.org/docs/devel/logical-replication.html" target="_blank">Logical Replication</a> that users could see in the recently released <a href="https://www.postgresql.org/docs/release/15.0/" target="_blank">PostgreSQL 15</a>. You can read the enhancements in this area in the previous release in one of my previous <a href="http://amitkapila16.blogspot.com/2021/09/logical-replication-improvements-in.html" target="_blank">blogs</a>.</p><p><u><b>Allow replication of prepared transactions:</b></u></p><p>In the last release, we allowed <a href="https://www.postgresql.fastware.com/blog/logical-decoding-of-two-phase-commits" target="_blank">logical decoding of prepared transactions</a> and with this release, we added the support to replicate prepared transactions to built-in logical replication. Previously, we send the changes of the prepared transaction only once the commit prepared had been done. Users can enable replication at PREPARE time with the following syntax:</p><p><span style="color: #ffa400;">CREATE PUBLICATION mypub FOR ALL TABLES;</span></p><p><span style="color: #ffa400;">CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres' PUBLICATION mypub WITH (two_phase = true);</span></p><p>The key advantages of this feature are:</p><p>(a) Reduces the lag to replicate data by replicating it at PREPARE time instead of waiting till the COMMIT PREPARED</p><p>(b) This provides the base to build conflict-free logical replication because if the prepare fails on subscriber nodes then one can rollback it on the publisher node as well.</p><p>The key implementation points:</p><p>(a) The replication of prepared transactions is enabled once the initial sync for all the tables is finished.</p><p>(b) To avoid conflicts in the prepared transaction during APPLY, we use the prepare identifier as pg_gid_<subscriber-id>_<transaction-id>.</p><p>(c) It is not allowed to change this option with ALTER SUBSCRIPTION command.</p><p>(d) ALTER SUBSCRIPTION REFRESH PUBLICATION is allowed with copy_data=false once the two_phase is enabled for a subscription.</p><p>For a detailed description of this feature, see the <a href="https://www.postgresql.fastware.com/blog/two-phase-commits-for-logical-replication-publications-subscriptions">blog</a>.</p><p><u><b>Allow replication of all tables in the schema:</b></u></p><p>Previously, one needs to specify all the tables of a particular schema while creating a publication if they want to publish all tables of that schema. Then, if later, the user creates more tables in that schema, they also need to be added to the publication separately. This would be inconvenient for users. This feature makes it much easier for users by allowing them to specify just the schema name in case they want all tables of the schema to be published. The syntax to specify schema name (TABLES IN SCHEMA) is as follows:</p><p><span style="color: #ffa400;">CREATE PUBLICATION mypub FOR TABLES IN SCHEMA mysch;</span></p><p><span style="color: #ffa400;">CREATE PUBLICATION mypub FOR TABLE mytab, TABLES IN SCHEMA mysch;</span></p><p>Note that it is allowed to specify schemas with individual tables from other schemas. Users can add the schemas to existing publications with the following syntax:</p><p><span style="color: #ffa400;">ALTER PUBLICATION mypub ADD TABLES IN SCHEMA mysch;</span></p><p>Note that adding schemas to a publication that is already subscribed to by some subscribers will require an ALTER SUBSCRIPTION … REFRESH PUBLICATION action on the subscriber side in order to become effective.</p><p>For a detailed description of this feature, see the <a href="https://www.postgresql.fastware.com/blog/logical-replication-of-all-tables-in-schema-in-postgresql-15" target="_blank">blog</a>.</p><p><u><b>Allow specifying row filters for logical replication of tables:</b></u></p><p>This feature allows specifying an additional WHERE clause after each table in the publication definition. Rows that don't satisfy this WHERE clause will be filtered out. This allows a set of tables to be partially replicated. The row filter is per table. The WHERE clause must be enclosed in parentheses. Users can define row filters with the following command:</p><p><span style="color: #ffa400;">CREATE PUBLICATION mypub FOR TABLE mytab1 WHERE (c1 > 10 and c2 < 20), mytab2 WHERE (c3 LIKE 'bob');</span></p><p>Users are allowed to specify row filters for existing tables in publication with the command:</p><p><span style="color: #ffa400;">ALTER PUBLICATION mypub SET TABLE mytab1 WHERE (c1 > 10 and c2 < 20), mytab2 WHERE (c3 LIKE 'bob');</span></p><p>This can help distribute data among nodes, improve performance by sending data selectively, and by hiding some sensitive data.</p><p>Key points to note about this feature:</p><p>(a) The row filter WHERE clause for a table added to a publication that publishes UPDATES and/or DELETES must contain only columns that are covered by REPLICA IDENTITY.</p><p>(b) The row filter WHERE clause for a table added to a publication that publishes INSERT can use any column.</p><p>(c) Row filters are ignored for TRUNCATE TABLE commands.</p><p>(d) If the row filter evaluates to NULL, it is regarded as "false" aka the corresponding row won't be replicated.</p><p>(e) The WHERE clause only allows simple expressions that don't have user-defined functions, user-defined operators, user-defined types, user-defined collations, non-immutable built-in functions, or references to system columns.</p><p>(f) During initial table synchronization, only data that satisfies the row filters is copied to the subscriber.</p><p>(g) For partitioned tables, the publication parameter publish_via_partition_root determines if it uses the partition's row filter (if the parameter is false, the default) or the root partitioned table's row filter.</p><p>For a detailed description of this feature, see <a href="https://www.postgresql.org/docs/devel/logical-replication-row-filter.html" target="_blank">docs</a> and <a href="https://www.postgresql.fastware.com/blog/introducing-publication-row-filters" target="_blank">blog</a>.</p><p><u><b>Allow specifying column lists for logical replication of tables:</b></u></p><p>This feature allows specifying an optional column list when adding a table to logical replication. Columns not included in this list are not sent to the subscriber, allowing the schema on the subscriber to be a subset of the publisher schema. The choice of columns can be based on behavioral or performance reasons. Users can define column lists with the following syntax:</p><p><span style="color: #ffa400;">CREATE PUBLICATION mypub FOR TABLE mytab1 (c1, c2), mytab2 (c3);</span></p><p>Users are allowed to specify column lists for existing tables in publication with the command:</p><p><span style="color: #ffa400;">ALTER PUBLICATION mypub SET TABLE mytab1 (c1, c2);</span></p><p>Key points to note about this feature:</p><p>(a) If a publication publishes UPDATES and/or DELETES, any column list must include the table's replica identity columns.</p><p>(b) If a publication publishes only INSERT operations, then the column list may omit replica identity columns.</p><p>(c) Column lists are ignored for TRUNCATE TABLE commands.</p><p>(d) A column list can contain only simple column references.</p><p>(e) A column list can't be specified if the publication also publishes FOR TABLES IN SCHEMA.</p><p>(f) During initial data synchronization, only the published columns are copied.</p><p>(g) For partitioned tables, the publication parameter publish_via_partition_root determines which column list is used. If publish_via_partition_root is true, the root partitioned table's column list is used. Otherwise, if publish_via_partition_root is false (the default), each partition's column list is used.</p><p>For a detailed description of this feature, see <a href="https://www.postgresql.org/docs/devel/logical-replication-col-lists.html" target="_blank">docs</a> and <a href="https://www.postgresql.fastware.com/blog/column-lists-in-logical-replication-publications" target="_blank">blog</a>.</p><p><u><b>Allows logical replication to run as the owner of the subscription:</b></u></p><p>Previously, the subscription's APPLY process will run with the privileges of a superuser but now with PostgreSQL 15, it will run with the privileges of the subscription owner. So, this would prevent logical replication workers from performing insert, update, delete, truncate, or copy commands on tables unless the subscription owner has permission to do so. We allow only superusers, roles with bypassrls, and table owners can replicate into tables with row-level security policies.</p><p>The purpose of this work is to allow subscriptions to be managed by non-superusers and protect servers with subscriptions from malicious activity on the publisher side.</p><p><u><b>Conflict Resolution:</b></u></p><p>The conflicts can happen due to various reasons like PRIMARY KEY violation, schema being different, etc. during apply of transactions in subscriber. By default, PostgreSQL will keep retrying the operation on an error. Before PostgreSQL 15, users have the following options (a) They can manually remove the conflicting data to allow replication to proceed. (b) Use pg_replication_origin_advance() to advance the LSN to a location beyond the failed transaction so that on restart replication starts from a point after the conflicting transaction.</p><p>It is quite inconvenient for users to use any of these methods because for option (a) users are forced to remove/change data on subscribers even though they want corresponding data from the publisher to be ignored. To use option (b), users need to find the LSN of the failing transaction probably by using pg_waldump or some other tool on the publisher side, and also the origin information is not apparent as it was generated internally for the purpose of replication. While using pg_replication_origin_advance(), if users by mistake set the wrong LSN (either of a future commit or of some operation in-between the transaction) then the system can omit the data that it was not supposed to leading to an inconsistent replica.</p><p>The other problem is that the system will keep retrying to apply the transaction even when it can't succeed without the users intervention and users don't have any way to stop it apart from manually disabling the subscription by using ALTER SUBSCRIPTION mysub DISABLE;</p><p>In PostgreSQL 15, we tried to make the use of pg_replication_origin_advance() easier by providing the required information and by providing a similar but more robust way. The other feature it provides is to allow subscriptions to be disabled on error.</p><p>Introduced a new subscription option 'disable_on_error' which allows subscription to be automatically disabled if any errors are detected by subscription workers during data replication from the publisher. Users can specify this option either during CREATE SUBSCRIPTION or in ALTER SUBSCRIPTION command.</p><p><span style="color: #ffa400;">CREATE SUBSCRIPTION mysub CONNECTION '…' PUBLICATION mypub WITH (disable_on_error = true);</span></p><p><span style="color: #ffa400;">ALTER SUBSCRIPTION mysub SET (disable_on_error = true);</span></p><p>Then, we extended the error context information of subscription worker error by adding (a) Finish LSN. It will indicate commit_lsn for committed transactions, and prepare_lsn for prepared transactions. (b) Replication origin name. This will contain the name of the replication origin that keeps track of replication progress and is created automatically with the subscription definition.</p><p>The extended error context information can make the use of pg_replication_origin_advance() easier for users.</p><p>Then, we also introduced a more robust way to skip the conflicting transactions by using the command: ALTER SUBSCRIPTION mysub SKIP (lsn = '0/1566D10'); This is more robust because it will prevent users to set some wrong LSN by performing checks against the specified LSN. We do ensure that the specified LSN must be the same as the first transaction's finish LSN that is sent by the publisher after the restart. The first transaction successfully applied to the subscriber will clear the specified LSN. We also ensure that the specified LSN must be greater than the origin's current LSN.</p><p>For a detailed description of this feature, see blogs [<a href="https://www.postgresql.fastware.com/blog/how-to-handle-logical-replication-conflicts-in-postgresql" target="_blank">1</a>] and [<a href="https://www.postgresql.fastware.com/blog/addressing-replication-conflicts-using-alter-subscription-skip" target="_blank">2</a>].</p><p><u><b>pg_stat_subscription_stats:</b></u></p><p>A new view that shows stats about errors that occurred during the application of logical replication changes or during initial table synchronization. See <a href="https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION-STATS" target="_blank">docs</a> for more information on this.</p><p><u><b>Communication improvements between publisher and subscriber:</b></u></p><p>PostgreSQL has made enhancements in communication to (a) prevent sending the transaction BEGIN/END messages where all the transaction data is filtered, and (b) prevent replication to restart due to timeouts while processing large transactions where most or all the data is filtered.</p><p>Before PostgreSQL 15, we use to send BEGIN/END messages for empty transactions (where all changes are skipped/filtered) which waste a lot of CPU cycles and network bandwidth to build and transmit such messages. To avoid sending messages for empty transactions, we started sending the BEGIN message only with the first change transmitted from publisher to subscriber and then we allow to send the END (COMMIT) message only when BEGIN is sent. To avoid any delays in synchronous replication, we do send a keepalive message after skipping an empty transaction and process its feedback.</p><p>While processing long transactions where most of the changes are filtered due to say the particular operations are not published, the publisher doesn't send any communication to the subscriber which times out after a certain threshold time leading to a restart of replication. To fix this, we start periodically sending keep_alive messages in such cases.</p><p>For a detailed description of this work, see the <a href="https://www.postgresql.fastware.com/blog/how-postgresql-15-improved-communication-in-logical-replication" target="_blank">blog</a>.</p><p>I believe this is a good mix of improvements for logical replication in PostgreSQL 15 which will help users. Your feedback here or on PostgreSQL mailing lists is welcome!</p>Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com0tag:blogger.com,1999:blog-8673752770146776575.post-10481277359258360222021-09-15T04:51:00.007-07:002021-09-15T04:57:11.915-07:00Logical Replication Improvements In PostgreSQL-14<div style="text-align: left;"><span style="color: #444444;">In the upcoming release of PostgreSQL-14, we will see multiple enhancements in <a href="https://www.postgresql.org/docs/devel/logical-replication.html">Logical Replication</a> which I hope will further increase its usage. This blog is primarily to summarize and briefly explain all the enhancements in Logical Replication.</span></div><div style="text-align: left;"><span style="color: #444444;"><br /></span></div><div style="text-align: left;"><span style="color: #444444;"><span><b><i><u>Decoding of large transactions:<br /></u></i></b></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>Allow streaming large in-progress transactions to subscribers. Before PostgreSQL-14, the transactions were streamed only at commit time which leads to a large apply lag for large transactions. With this feature, we will see apply lag to be reduced, and in certain scenarios that will lead to a big performance win. I have explained this feature in detail in my previous <a href="https://amitkapila16.blogspot.com/2021/07/logical-replication-of-in-progress.html">blog</a>.<br /></span><span><b><i><u><br /></u></i></b></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span><b><i><u>Performance of logical decoding:<br /></u></i></b></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>Reduced the CPU usage and improve decoding performance of transactions having a lot of DDLs. It has been observed that decoding of a transaction containing truncation of a table with 1000 partitions would be finished in 1s whereas before this work it used to take 4-5 minutes. Before explaining, how we have achieved this performance gain, let me briefly tell what an invalidation message is in PostgreSQL as that is important to understand this optimization. These are messages to flush invisible system cache entries in each backend session. We normally execute these at the command end in the backend which generated them and send them </span><span>at the transaction end </span><span>via a shared queue to other backends for processing. These are normally generated for insert/delete/update operations on system catalogs which happens for DDL operations.</span></span></div><div style="text-align: left;"><span style="color: #444444;"><span><br /></span><span>While decoding we use to execute all the invalidations of an entire transaction at each command end as we had no way of knowing which invalidations happened before that command. Due to this, transactions involving large amounts of DDLs use to take more time and also lead to high CPU usage. But now we know specific invalidations at each command end so we execute only required invalidations. This work has been accomplished by commit </span><span><a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc">d7eb52d718</a>.</span></span></div><div style="text-align: left;"><span style="color: #444444;"><b><i><u><br /></u></i></b></span></div><div style="text-align: left;"><span style="color: #444444;"><span><b><i><u>Initial table sync:<br /></u></i></b></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>The initial table synchronization involves copying the initial snapshot of the table by the table sync worker and then the table is brought up to a synchronized state with the main apply worker. This whole work use to be done in a single transaction using a temporary replication slot which has major drawbacks: (a) The slot will hold the WAL till the entire sync is complete. </span><span>(b) Any error during the sync phase will rollback the entire copy which is painful for large copies. (c) There is a risk of exceeding the CID limit.<br /></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>We did below improvements to overcome the drawbacks:<br /></span><span>Allowed multiple transactions in tablesync phase.<br /></span><span>Used permanent slots and origins to track the progress of tablesync.<br /></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>This work has been explained in detail in the <a href="https://www.postgresql.fastware.com/blog/logical-replication-tablesync-workers">blog</a>. This work has been accomplished by commit </span><span><a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e">ce0fdbfe97</a>.</span><br /><span><b><i><u><br /></u></i></b></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span><b><i><u>Logical decoding of two-phase commits:<br /></u></i></b></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>This will allow us to decode the transactions at prepare time and send the same to the output plugin instead of doing it at commit time. This will allow the plugins to decipher the transaction at prepare time and route it to another node if required. This has two advantages (a) allows two-phase distributed transactions across multiple nodes via logical replication, (b) reduces the apply-lag by sending and replaying the transaction on another node at prepare time.<br /></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span>This work has been explained in detail in the <a href="https://www.postgresql.fastware.com/blog/logical-decoding-of-two-phase-commits">blog</a></span><span>. This work has been accomplished by commits [<a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=0aa8a01d04c8fe200b7a106878eebc3d0af9105c">1</a>][<a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a271a1b50e9bec07e2ef3a05e38e7285113e4ce6">2</a>][<a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=19890a064ebf53dedcefed0d8339ed3d449b06e6">3</a>].<br /></span><span><b><i><u><br /></u></i></b></span></span></div><div style="text-align: left;"><span style="color: #444444;"><span><b><i><u>Monitor logical decoding:<br /></u></i></b></span><span><br /></span></span></div><div style="text-align: left;"><span style="color: #444444;">Replication slots are used to keep state about replication streams originating from this cluster. Their primary purpose is to prevent the premature removal of WAL. In the context of logical replication, a slot represents a stream of changes that can be replayed to a client in the order they were made on the origin server. Each slot streams a sequence of changes from a single database.</span></div><div style="text-align: left;"><span style="color: #444444;"><span><br /></span><span>We have added a system view <a href="https://www.postgresql.org/docs/14/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-SLOTS-VIEW">pg_stat_replication_slots</a> to report replication slot activity. This can be used to monitor the amount of data streamed to output plugin or subscriber, spilled to disk. Additionally user can monitor total amount of transaction data decoded for sending transactions to the decoding output plugin while decoding changes from WAL for this slot. Note that this includes data that is streamed and/or spilled. The function pg_stat_reset_replication_slot() resets slot statistics.</span></span></div><p style="text-align: left;"><span style="color: #444444;">Example:</span></p><span style="color: #444444;"></span><span style="color: #ffa400;">CREATE TABLE stats_test(data text);</span> <blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;">SET logical_decoding_work_mem to '64kB';</span><span style="color: #ffa400;"><br /></span><blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;">SELECT 'init' FROM pg_create_logical_replication_slot('slot_stats', 'test_decoding');</span><span style="color: #ffa400;"><br /></span><blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;">INSERT INTO stats_test SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000) g(i);</span><span style="color: #ffa400;"><br /></span><blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;">SELECT count(*) FROM pg_logical_slot_peek_changes('slot_stats', NULL, NULL, 'skip-empty-xacts', '1');</span><div><span style="color: #ffa400;"><br /></span><span style="color: #ffa400;">SELECT slot_name, spill_txns, spill_count, spill_bytes, total_txns, total_bytes FROM pg_stat_replication_slots;</span> <blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;"> slot_name | spill_txns | spill_count | spill_bytes | total_txns | total_bytes</span><br /><blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;">------------+------------+-------------+-------------+------------+-------------</span><br /><blockquote style="text-align: left;"></blockquote><span style="color: #ffa400;"> slot_stats | 1 | 12 | 763893 | 1 | 763893</span><br /><span style="color: #ffa400;">(1 row)</span> <div><br /><span style="color: #ffa400;">DROP TABLE stats_test;</span> <br /><span style="color: #ffa400;"></span><div style="text-align: left;"><span style="color: #ffa400;"><br /></span></div><div style="text-align: left;"><span style="color: #ffa400;">SELECT pg_drop_replication_slot('slot_stats');</span> </div><span><div style="color: #666666;"><br /></div></span><div><span style="color: #444444;"><b><i><u>Allow publications to be easily added and removed:</u></i></b><br /><br /></span></div><div><span style="color: #444444;">Currently, if the user needs to add/remove additional publications to a subscription, she needs to mention all the existing publications along with it. Consider a case where a subscription is subscribed to two publications and we want to add an additional publication to it then the user needs to mention all the three (two previous and one new) while doing Alter Subscription. The same is explained with an example below:<br /><br /></span></div><div><span style="color: #444444;">Initial Subscription</span><br /><span style="color: #ffa400;">CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION mypub1, mypub2;</span><br /><br /></div><div><span style="color: #444444;">Add a new Publication mypub3.<br /></span><span style="color: #ffa400;">ALTER SUBSCRIPTION mysub SET PUBLICATION mypub1, mypub2, mypub3;</span></div><div><span style="color: #444444;"><br /></span></div><div><span style="color: #444444;">This could be inconvenient for users especially if there are many existing publications to which a subscription is subscribed. We have added a new way to make this easier by supporting ADD/DROP individual publications. See <a href="https://www.postgresql.org/docs/devel/sql-altersubscription.html">docs</a> for the syntax. With the new way, in the above case, to add a new publication, the user needs to perform <br /></span><span style="color: #ffa400;">ALTER SUBSCRIPTION mysub ADD mypub3;</span></div><div><span style="color: #444444;"><br /></span></div><div><span style="color: #444444;"><span style="color: black;"><span style="color: #444444;">This work has been accomplished by commit </span></span><span style="color: #444444;"><span><a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=82ed7748b710e3ddce3f7ebc74af80fe4869492f">82ed7748b7</a>.</span></span></span></div><div><span style="color: #444444;"><br /><b><i><u>Binary transfer mode:</u></i></b><br /><br /></span></div><div><span style="color: #444444;">This feature provides an option during Create/Alter Subscription to allow data from publishers to be sent in binary format. The default value of this option is false. Even when this option is enabled, only data types that have binary send and receive functions will be transferred in binary. When doing cross-version replication, if the subscriber lacks a binary receive function for the type, the data transfer will fail, and this option can't be used. This mode is generally faster. Example to enable binary mode:</span><br /><span style="color: #666666;"><br /></span></div><div><span style="color: #ffa400;">CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION mypub1 WITH (binary = true);</span><br /><span style="color: #666666;"><br /></span></div><div><span style="color: #444444;">This work has been accomplished by commit <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9de77b5453130242654ff0b30a551c9c862ed661">9de77b5453</a>.<br /><b><i><u><br /></u></i></b></span></div><div><span style="color: #444444;"><b><i><u>Allow to get messages via pgoutput:</u></i></b><br /><br /></span></div><div><span style="color: #444444;">Provide a “messages” option to the pgoutput plugin. This allows logical decoding messages (i.e. generated via pg_logical_emit_message) to be sent to the slot consumer. This is useful for pgoutput plugin users that use it for Change Data Capture. An example of the same is given below:</span></div><div><span style="color: #ffa400;"><br /></span></div><div><span style="color: #ffa400;">SELECT pg_create_logical_replication_slot('pgout_slot','pgoutput');</span></div><div style="text-align: left;"><span style="color: #ffa400;"><br /></span></div><div style="text-align: left;"><span style="color: #ffa400;">SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')<br /><br /></span></div><div style="text-align: left;"><span style="color: #ffa400;">SELECT get_byte(data, 1), encode(substr(data, 24, 23), 'escape') FROM pg_logical_slot_peek_binary_changes('pgout_slot', NULL, NULL, 'proto_version', '1', 'publication_names', 'pgout_slot', 'messages', 'true') OFFSET 1 LIMIT 1;<br /><br /></span></div><div style="text-align: left;"><span style="color: #ffa400;"> get_byte | encode<br />----------+-------------------------<br /> 1 | a transactional message<br />(1 row)<br /><br /></span></div><div><span style="color: #ffa400;">SELECT pg_drop_replication_slot('pgout_slot');<br /></span><div style="text-align: left;"><span style="color: #444444;"><span></span><span><b><i><u></u></i></b></span></span><blockquote></blockquote><blockquote></blockquote><blockquote></blockquote><blockquote></blockquote><blockquote></blockquote><blockquote></blockquote><blockquote></blockquote><blockquote></blockquote><span style="color: #444444;">While getting changes, the publication_names is not required for logical decoding messages but is specified just so that function doesn't give an error. You can refer to logical replication message formats in the <a href="https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html">PostgreSQL docs</a>. This work has been accomplished by commit <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ac4645c0157fc5fcef0af8ff571512aa284a2cec">ac4645c015</a>.</span></div></div></div></div>Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com5tag:blogger.com,1999:blog-8673752770146776575.post-81641688199006970032021-07-17T05:16:00.004-07:002021-07-21T02:35:42.878-07:00Logical Replication Of In-Progress Transactions<p><span style="color: #444444; font-family: arial;">Logical Replication was introduced in PostgreSQL-10 and since then it is being improved with each version. Logical Replication is a method to replicate the data selectively unlike physical replication where the data of the entire cluster is copied. This can be used to build a multi-master or bi-directional replication solution. One of the main differences as compared with physical replication was that it allows replicating the transaction only at commit time. This leads to apply lag for large transactions where we need to wait to transfer the data till the transaction is finished. In the upcoming PostgreSQL-14 release, we are introducing a mechanism to stream the large in-progress transactions. We have seen the replication performance improved by 2 or more times due to this for large transactions especially due to early filtering. See the performance </span><span style="color: #444444; font-family: arial;">test </span><span style="color: #444444; font-family: arial;">results </span><span style="color: #444444; font-family: arial;">reported </span><span style="color: #444444; font-family: arial;">on <a href="https://www.postgresql.org/message-id/CAFPTHDZv4XiK2VJG_fpjuvq7rv2-o4185uzJJvmzgAW%3D64FY8Q%40mail.gmail.com">hackers</a> </span><span style="color: #444444; font-family: arial;">and in another </span><a href="https://www.enterprisedb.com/blog/logical-decoding-large-progress-transactions-postgresql" style="font-family: arial;">blog</a><span style="color: #444444; font-family: arial;"> on the same topic. This will reduce the apply lag to a good degree.</span></p><p><span style="color: #444444; font-family: arial;">The first thing we need for this feature was to decide when to start streaming the WAL content. One could think if we have such a technology why not stream each change of transaction separately as and when we retrieve it from WAL but that would actually lead to sending much more data across the network because we need to send some additional transaction information with each change so that the apply-side can recognize the transaction to which the change belongs. To address this, in PostgreSQL-13, we have introduced a new GUC parameter <a href="https://www.postgresql.org/docs/devel/runtime-config-resource.html">logical_decoding_work_mem</a> which allows users to specify the maximum amount of memory to be used by logical decoding, before which some of the decoded changes are either written to local disk or stream to the subscriber. The parameter is also used to control the memory used by logical decoding as explained in the <a href="https://www.enterprisedb.com/postgres-tutorials/postgres-13-logicaldecodingworkmem-and-how-it-saves-your-server-going-out-memory">blog</a>.</span></p><p><span style="color: #444444; font-family: arial;">The next thing that prevents incremental decoding was the delay in finding the association of subtransaction and top-level XID. During logical decoding, we accumulate all changes along with its (sub)transaction. Now, while sending the changes to the output plugin or stream to the other node, we need to combine all the changes that happened in the transaction which requires us to find the association of each top-level transaction with its subtransactions. Before PostgreSQL-14, we build this association at XLOG_XACT_ASSIGNMENT WAL record which we normally log after 64 subtransactions or at commit time because these are the only two times when we get such an association in the WAL. To find this association as it happened, we now also write the assignment info into WAL immediately, as part of the first WAL record for each subtransaction. This is done only when wal_level=logical to minimize the overhead.</span></p><p><span style="color: #444444; font-family: arial;">Yet, another thing that is required for incremental decoding was to process invalidations at each command end. The basic idea of invalidations is that they make the caches (like relation cache) up-to-date to allow the next command to use up-to-date schema. This was required to correctly decode WAL incrementally as while decoding we will use the relation attributes from the caches. For this, when wal_level=logical, we write invalidations at the command end into WAL so that decoding can use this information. The invalidations are decoded and accumulated in top-transaction, and then executed during replay. This obviates the need to decode the invalidations as part of a commit record.</span></p><p><span style="color: #444444; font-family: arial;">In previous paragraphs, the enhancements required in the server infrastructure to allow incremental decoding are explained. The next step was to provide APIs (stream methods) for out-of-core logical replication to stream large in-progress transactions. We added seven methods to the output plugin API to allow this. Those are: (stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb and stream_change_cb) and two optional callbacks (stream_message_cb and stream_truncate_cb). For details about these APIs, refer to <a href="https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin.html#LOGICALDECODING-OUTPUT-PLUGIN-CALLBACKS">PostgreSQL docs</a>.</span></p><p><span style="color: #444444; font-family: arial;">When streaming an in-progress transaction, the changes (and messages) are streamed in blocks demarcated by stream_start_cb and stream_stop_cb callbacks. Once all the decoded changes are transmitted, the transaction can be committed using the stream_commit_cb callback (or possibly aborted using the stream_abort_cb callback). One example sequence of streaming transaction may look like the following:</span></p><div style="text-align: left;"><span style="background-color: white;"><span style="color: #f1c232;"><span style="font-family: arial;">/* Change logical_decoding_work_mem to 64kB in the session */<br /></span><span style="font-family: arial;">postgres=# show logical_decoding_work_mem;<br /></span><span style="font-family: arial;"> logical_decoding_work_mem<br /></span><span style="font-family: arial;">---------------------------<br /></span><span style="font-family: arial;"> 64kB<br /></span><span style="font-family: arial;">(1 row)<br /></span><span style="font-family: arial;">postgres=# CREATE TABLE stream_test(data text);<br /></span><span style="font-family: arial;">CREATE TABLE</span></span></span></div><div style="text-align: left;"><span style="background-color: white;"><span style="color: #f1c232;"><span style="font-family: arial;">postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');<br /></span><span style="font-family: arial;"> ?column?<br /></span><span style="font-family: arial;">----------<br /></span><span style="font-family: arial;"> init<br /></span><span style="font-family: arial;">(1 row)<br /></span><span style="font-family: arial;">postgres=# INSERT INTO stream_test SELECT repeat('a', 6000) || g.i FROM generate_series(1, 500) g(i);<br /></span><span style="font-family: arial;">INSERT 0 500<br /></span><span style="font-family: arial;">postgres=# SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'include-xids', '1', 'skip-empty-xacts', '1', 'stream-changes', '1');<br /></span><span style="font-family: arial;"> data<br /></span><span style="font-family: arial;">--------------------------------------------------<br /></span><span style="font-family: arial;"> opening a streamed block for transaction TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;">...<br /></span><span style="font-family: arial;">...<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> closing a streamed block for transaction TXN 741<br /></span><span style="font-family: arial;"> opening a streamed block for transaction TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;">...<br /></span><span style="font-family: arial;">...<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> streaming change for TXN 741<br /></span><span style="font-family: arial;"> closing a streamed block for transaction TXN 741<br /></span><span style="font-family: arial;"> committing streamed transaction TXN 741<br /></span><span style="font-family: arial;">(505 rows)</span></span></span></div><p><span style="color: #444444; font-family: arial;">The actual sequence of callback calls may be more complicated depending on the server operations. There may be blocks for multiple streamed transactions, some of the transactions may get aborted, etc.</span></p><p><span style="color: #444444; font-family: arial;">Note that streaming is triggered when the total amount of changes decoded from the WAL (for all in-progress transactions) exceeds the limit defined by the <a href="https://www.postgresql.org/docs/devel/runtime-config-resource.html">logical_decoding_work_mem</a> setting. At that point, the largest top-level transaction (measured by the amount of memory currently used for decoded changes) is selected and streamed. However, in some cases we still have to spill to disk even if streaming is enabled because we exceed the memory threshold but still have not decoded the complete tuple e.g., only decoded toast table insert but not the main table insert or decoded speculative insert but not the corresponding confirm record. However, as soon as we get the complete tuple we stream the transaction including the serialized changes.</span></p><p><span style="color: #444444; font-family: arial;">While streaming in-progress transactions, the concurrent aborts may cause failures when the output plugin (or decoding of WAL records) consults catalogs (both system and user-defined). Let me explain this with an example, suppose there is one catalog tuple with (xmin: 500, xmax: 0). Now, the transaction 501 updates the catalog tuple and after that we will have two tuples (xmin: 500, xmax: 501) and (xmin: 501, xmax: 0). Now, if 501 is aborted and some other transaction say 502 updates the same catalog tuple then the first tuple will be changed to (xmin: 500, xmax: 502). So, the problem is that when we try to decode the tuple inserted/updated in 501 after the catalog update, we will see the catalog tuple with (xmin: 500, xmax: 502) as visible because it will consider that the tuple is deleted by xid 502 which is not visible to our snapshot. And when we will try to decode with that catalog tuple, it can lead to a wrong result or a crash. So, it is necessary to detect concurrent aborts to allow streaming of in-progress transactions. For detecting the concurrent abort, during catalog scan we can check the status of the xid and if it is aborted we will report a specific error so that we can stop streaming current transaction and discard the already streamed changes on such an error. We might have already streamed some of the changes for the aborted (sub)transaction, but that is fine because when we decode the abort we will stream the abort message to truncate the changes in the subscriber.</span></p><p><span style="color: #444444; font-family: arial;">To add support for streaming of in-progress transactions into the built-in logical replication, we need to primarily do four things:</span></p><p><span style="color: #444444; font-family: arial;">(a) Extend the logical replication protocol to identify in-progress transactions, and allow adding additional bits of information (e.g. XID of subtransactions). Refer to <a href="https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html">PostgreSQL docs</a> for the protocol details. </span></p><p><span style="color: #444444; font-family: arial;">(b) Modify the output plugin (pgoutput) to implement the new stream API callbacks, by leveraging the extended replication protocol.</span></p><p><span style="color: #444444; font-family: arial;">(c) Modify the replication apply worker, to properly handle streamed in-progress transaction by spilling the data to disk and then replaying them on commit.</span></p><p><span style="color: #444444; font-family: arial;">(d) Provide a new option for streaming while creating a subscription.</span></p><p><span style="color: #444444; font-family: arial;">The below example demonstrates how to set up the streaming via built-in logical replication:</span></p><p><span style="color: #f1c232; font-family: arial;"><b>Publisher node:</b></span></p><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;">Set logical_decoding_work_mem = '64kB';<br /></span><span style="font-family: arial;"># Set up publication with some initial data<br /></span><span style="font-family: arial;"><br /></span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;">CREATE TABLE test_tab (a int primary key, b varchar);<br /></span><span style="font-family: arial;">INSERT INTO test_tab VALUES (1, 'foo'), (2, 'bar');<br /></span><span style="font-family: arial;">CREATE PUBLICATION tap_pub FOR TABLE test_tab;</span></span></div><p><span style="color: #f1c232; font-family: arial;"><b>Subscriber node:</b></span></p><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;">CREATE TABLE test_tab (a int primary key, b varchar);<br /></span><span style="font-family: arial;">CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost port=5432 dbname=postgres' PUBLICATION tap_pub WITH (streaming = on);</span></span></div><p><span style="color: #f1c232; font-family: arial;"><b>Publisher Node:</b></span></p><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"># Ensure the corresponding replication slot is created on publisher node<br /></span><span style="font-family: arial;">select slot_name, plugin, slot_type from pg_replication_slots;<br /></span><span style="font-family: arial;"> slot_name | plugin | slot_type<br /></span><span style="font-family: arial;">-----------+----------+-----------<br /></span><span style="font-family: arial;"> tap_sub | pgoutput | logical<br /></span><span style="font-family: arial;">(1 row)<br /></span><span style="font-family: arial;"><br /></span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"># Confirm there is no streamed bytes yet<br /></span><span style="font-family: arial;">postgres=# SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots;<br /></span><span style="font-family: arial;"> slot_name | stream_txns | stream_count | stream_bytes<br /></span><span style="font-family: arial;">-----------+-------------+--------------+--------------<br /></span><span style="font-family: arial;"> tap_sub | 0 | 0 | 0<br /></span><span style="font-family: arial;">(1 row)</span><span style="font-family: arial;"><br /></span><span style="font-family: arial;"><br /></span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"># Insert, update and delete enough rows to exceed the logical_decoding_work_mem (64kB) limit.<br /></span><span style="font-family: arial;">BEGIN;<br /></span><span style="font-family: arial;">INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i);<br /></span><span style="font-family: arial;">UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0;<br /></span><span style="font-family: arial;">DELETE FROM test_tab WHERE mod(a,3) = 0;</span><span style="font-family: arial;"><br /></span><span style="font-family: arial;"><br /></span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"># Confirm that streaming happened<br /></span><span style="font-family: arial;">SELECT slot_name, stream_txns, stream_count, stream_bytes FROM pg_stat_replication_slots;<br /></span><span style="font-family: arial;"> slot_name | stream_txns | stream_count | stream_bytes<br /></span><span style="font-family: arial;">-----------+-------------+--------------+--------------<br /></span><span style="font-family: arial;"> tap_sub | 1 | 22 | 1444410<br /></span><span style="font-family: arial;">(1 row)</span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"><br /></span><span style="font-family: arial;"><b>Subscriber Node:<br /></b></span><span style="font-family: arial;"># The streamed data is still not visible.<br /></span><span style="font-family: arial;">select * from test_tab;<br /></span><span style="font-family: arial;"> a | b<br /></span><span style="font-family: arial;">---+-----<br /></span><span style="font-family: arial;"> 1 | foo<br /></span><span style="font-family: arial;"> 2 | bar<br /></span><span style="font-family: arial;">(2 rows)</span><span style="font-family: arial;"><br /></span><span style="font-family: arial;"><b><br /></b></span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"><b>Publisher Node:<br /></b></span><span style="font-family: arial;"># Commit the large transactions<br /></span><span style="font-family: arial;">Commit;<br /></span><span style="font-family: arial;"><b><br /></b></span></span></div><div style="text-align: left;"><span style="color: #f1c232;"><span style="font-family: arial;"><b>Subscriber Node:<br /></b></span><span style="font-family: arial;"># The data must be visible on the subscriber<br /></span><span style="font-family: arial;">select count(*) from test_tab;<br /></span><span style="font-family: arial;"> count<br /></span><span style="font-family: arial;">-------<br /></span><span style="font-family: arial;"> 3334<br /></span><span style="font-family: arial;">(1 row)</span></span></div><p style="text-align: left;"><span style="color: #444444; font-family: arial;">This feature was proposed in <a href="https://www.postgresql.org/message-id/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com">2017</a> and committed in 2020 as part of various commits <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0bead9af48">0bead9af48</a>, <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0">c55040ccd0</a>, <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=45fdc9738b">45fdc9738b</a>, <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7259736a6e5b7c7588fff9578370736a6648acbb">7259736a6e</a>, and <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=464824323e57dc4b397e8b05854d779908b55304">464824323e</a></span><span style="color: #444444; font-family: arial;">. It took a long time to complete this feature because of the various infrastructure pieces required to achieve this. I would really like to thank all the people involved in this feature especially </span><a href="https://www.linkedin.com/in/tomasvondra/?originalSubdomain=cz" style="font-family: arial;">Tomas Vondra</a><span style="color: #444444; font-family: arial;"> who has initially proposed it and then </span><a href="https://www.linkedin.com/in/dilip-kumar-6b75863a/" style="font-family: arial;">Dilip Kumar</a><span style="color: #444444; font-family: arial;"> who along with me had completed various remaining parts and made it a reality. Then also to other people like </span><a href="https://www.linkedin.com/in/neha-sharma-9b9b1327/" style="font-family: arial;">Neha Sharma</a><span style="color: #444444; font-family: arial;">, </span><a href="https://www.linkedin.com/in/mahendra-singh-thalor-02786b87/" style="font-family: arial;">Mahendra Singh Thalor</a><span style="color: #444444; font-family: arial;">, </span><a href="https://www.linkedin.com/in/ajin-cherian-8827b241/" style="font-family: arial;">Ajin Cherian</a><span style="color: #444444; font-family: arial;">, and </span><a href="https://www.linkedin.com/in/kuntal-ghosh-4822073a/" style="font-family: arial;">Kuntal Ghosh</a><span style="color: #444444; font-family: arial;"> who helped throughout the project to do reviews and various tests. Also, special thanks to </span><a href="https://www.linkedin.com/in/andres-freund/" style="font-family: arial;">Andres Freund</a><span style="color: #444444; font-family: arial;"> and other community members who have suggested solutions to some of the key problems of this feature. Last but not least, thanks to </span><a href="https://www.enterprisedb.com/" style="font-family: arial;">EDB</a><span style="color: #444444; font-family: arial;"> and </span><a href="https://www.postgresql.fastware.com/" style="font-family: arial;">Fujitsu's</a><span style="color: #444444; font-family: arial;"> management who encouraged me and some of the other members to work on this feature.</span></p>Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com0tag:blogger.com,1999:blog-8673752770146776575.post-74956440059535818962020-05-10T21:20:00.000-07:002020-05-10T21:20:10.332-07:00Improved (auto)vacuum in PostgreSQL 13<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<a href="https://www.postgresql.org/docs/devel/sql-vacuum.html" target="_blank">Vacuum</a> is one of the sub-systems in PostgreSQL which gets improved with each release. I have checked past five <a href="https://www.postgresql.org/docs/release/" target="_blank">releases</a> and each has quite a few improvements for vacuum. Following the trend, there are a number of improvements in vacuum in the upcoming PostgreSQL release (v13) which are covered in this blog.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Improvement-1</b></div>
<div style="text-align: justify;">
<b>---------------------</b></div>
<div style="text-align: justify;">
Vacuum will be allowed to process indexes in parallel. This means it can leverage multiple CPUs in order to perform index cleanup. This internally uses the background worker infrastructure of PostgreSQL to accomplish the work. There is a new option PARALLEL which controls the parallelism used by vacuum. Users can use the new option to specify the number of workers that can be used to perform the vacuum command which is limited by the minimum of (a) the number of indexes on a table and (b) <a href="https://www.postgresql.org/docs/devel/runtime-config-resource.html#GUC-MAX-PARALLEL-WORKERS-MAINTENANCE" target="_blank">max_parallel_maintenance_workers</a>. The parallelism for vacuum command is enabled by default which means it will be used even if the user didn't specify the PARALLEL option and it uses the number of workers equal to the number of indexes on the table being vacuumed. We can disable parallelism for vacuum by specifying zero as the number of workers with the PARALLEL option. The index can participate in parallel</div>
<div style="text-align: justify;">
vacuum iff it's size is greater than <a href="https://www.postgresql.org/docs/devel/runtime-config-query.html#GUC-MIN-PARALLEL-INDEX-SCAN-SIZE" target="_blank">min_parallel_index_scan_size</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The PARALLEL option can't be used with the FULL option in vacuum command. This feature won't be available via autovacuum, users need to use vacuum command to get the benefit provided by this feature.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
There are a number of other blogs [<a href="https://blog.dbi-services.com/postgresql-13-parallel-vacuum-for-indexes/" target="_blank">1</a>][<a href="https://www.highgo.ca/2020/02/28/parallel-vacuum-in-upcoming-postgresql-13/" target="_blank">2</a>] written on this topic which shows the benefit of this feature. A recent <a href="https://www.enterprisedb.com/postgres-tutorials/what-parallel-vacuum-postgresql-13" target="_blank">blog published by EnterpriseDB</a> shows that vacuum could be approximately 4 times faster by using 7 workers especially when the relation is in dire need of vacuum, read that blog for more information about the test.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here is a simple test to show the usage of this feature.</div>
<div style="text-align: justify;">
postgres=# create table pvac(c1 int, c2 text DEFAULT md5(random()::text), c3 text DEFAULT md5(random()::text));</div>
<div style="text-align: justify;">
CREATE TABLE</div>
<div style="text-align: justify;">
postgres=# create index pvac_1 on pvac(c1);</div>
<div style="text-align: justify;">
CREATE INDEX</div>
<div style="text-align: justify;">
postgres=# create index pvac_2 on pvac(c2);</div>
<div style="text-align: justify;">
CREATE INDEX</div>
<div style="text-align: justify;">
postgres=# create index pvac_3 on pvac(c3);</div>
<div style="text-align: justify;">
CREATE INDEX</div>
<div style="text-align: justify;">
postgres=# insert into pvac select i FROM generate_series(1,100000) as i;</div>
<div style="text-align: justify;">
INSERT 0 100000</div>
<div style="text-align: justify;">
postgres=# update pvac set c1=c1;</div>
<div style="text-align: justify;">
UPDATE 100000</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# vacuum (parallel 4, verbose) pvac;</div>
<div style="text-align: justify;">
INFO: vacuuming "public.pvac"</div>
<div style="text-align: justify;">
INFO: launched 2 parallel vacuum workers for index vacuuming (planned: 2)</div>
<div style="text-align: justify;">
INFO: scanned index "pvac_1" to remove 100000 row versions</div>
<div style="text-align: justify;">
..</div>
<div style="text-align: justify;">
..</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In the above test, 2 parallel workers were used even though we have specified 4 and the reason is that the number of indexes is 3, and the number of workers can't be more than index which is further limited by max_parallel_maintenance_workers as you can see by command below.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# show max_parallel_maintenance_workers;</div>
<div style="text-align: justify;">
max_parallel_maintenance_workers </div>
<div style="text-align: justify;">
----------------------------------</div>
<div style="text-align: justify;">
2</div>
<div style="text-align: justify;">
(1 row)</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The other most important thing we have ensured in this feature is that the parallelism won't use more memory or I/O bandwidth as compared to non-parallel vacuum.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Improvement-2</b></div>
<div style="text-align: justify;">
<b>---------------------</b></div>
<div style="text-align: justify;">
Allow inserts to trigger autovacuum activity. This feature will be really helpful for the insert-only tables where anti-wraparound vacuums could be the first vacuum that the table ever receives and such a run would take a really long time. This allows heap pages to be set as all-visible, which then allows index-only scans to skip heap fetches, and reduces the work necessary when the table needs to be frozen. This is controlled by two new GUCs and reloptions; <a href="https://www.postgresql.org/docs/devel/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-INSERT-THRESHOLD" target="_blank">autovacuum_vacuum_insert_threshold</a> and <a href="https://www.postgresql.org/docs/devel/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-INSERT-SCALE-FACTOR" target="_blank">autovacuum_vacuum_insert_scale_factor</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Let me demonstrate the benefit of this feature with the help on an example. Start the server with autovacuum_vacuum_insert_threshold = -1 (one can edit postgresql.conf file to change value of this parameter or can use Alter System command). By connecting with psql, we can execute below commands to see the behavior.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# show autovacuum_vacuum_insert_threshold;</div>
<div style="text-align: justify;">
autovacuum_vacuum_insert_threshold </div>
<div style="text-align: justify;">
------------------------------------</div>
<div style="text-align: justify;">
-1</div>
<div style="text-align: justify;">
(1 row)</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# create table vac_ins(c1 int, c2 char(500));</div>
<div style="text-align: justify;">
CREATE TABLE</div>
<div style="text-align: justify;">
postgres=# create index idx_vac_ins on vac_ins(c1);</div>
<div style="text-align: justify;">
CREATE INDEX</div>
<div style="text-align: justify;">
postgres=# insert into vac_ins values(generate_series(1,20000),'aaaaaa');</div>
<div style="text-align: justify;">
INSERT 0 20000</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After a few seconds, you can notice the below message in the server log which shows that autovacuum has performed analyze on the table.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
LOG: automatic analyze of table "postgres.public.vac_ins" system usage: CPU: user: 0.03 s, system: 0.11 s, elapsed: 0.15 s</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After that, run below command:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# explain (analyze) select c1 from vac_ins where c1 < 100;</div>
<div style="text-align: justify;">
QUERY PLAN </div>
<div style="text-align: justify;">
-----------------------------------------------------------------------------------------------------------------------------</div>
<div style="text-align: justify;">
Index Only Scan using idx_vac_ins on vac_ins (cost=0.29..16.02 rows=99 width=4) (actual time=0.019..0.092 rows=99 loops=1)</div>
<div style="text-align: justify;">
Index Cond: (c1 < 100)</div>
<div style="text-align: justify;">
Heap Fetches: 99</div>
<div style="text-align: justify;">
Planning Time: 0.269 ms</div>
<div style="text-align: justify;">
Execution Time: 0.129 ms</div>
<div style="text-align: justify;">
(5 rows)</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here, "Heap Fetches: 99" shows that the above query need to visit heap to fetch the required information even though it is present in index and the scan type used is Index Only Scan.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# truncate vac_ins;</div>
<div style="text-align: justify;">
TRUNCATE TABLE</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Now, restart the server with a default value of autovacuum_vacuum_insert_threshold and execute below commands from psql to see how the new feature helps:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# show autovacuum_vacuum_insert_threshold;</div>
<div style="text-align: justify;">
autovacuum_vacuum_insert_threshold </div>
<div style="text-align: justify;">
------------------------------------</div>
<div style="text-align: justify;">
1000</div>
<div style="text-align: justify;">
(1 row)</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# insert into vac_ins values(generate_series(1,20000),'aaaaaa');</div>
<div style="text-align: justify;">
INSERT 0 20000</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After a few seconds, you can notice the below message in the server log which indicates that autovacuum has performed both vacuum and analyze on the table.</div>
<div style="text-align: justify;">
LOG: automatic vacuum of table "postgres.public.vac_ins": index scans: 0</div>
<div style="text-align: justify;">
<span style="white-space: pre;"> </span>pages: 0 removed, 1334 remain, 0 skipped due to pins, 0 skipped frozen</div>
<div style="text-align: justify;">
...</div>
<div style="text-align: justify;">
LOG: automatic analyze of table "postgres.public.vac_ins" system usage: CPU: user: 0.01 s, system: 0.15 s, elapsed: 0.21 s</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After that, run below command:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# explain (analyze) select c1 from vac_ins where c1 < 100;</div>
<div style="text-align: justify;">
QUERY PLAN </div>
<div style="text-align: justify;">
----------------------------------------------------------------------------------------------------------------------------</div>
<div style="text-align: justify;">
Index Only Scan using idx_vac_ins on vac_ins (cost=0.29..6.02 rows=99 width=4) (actual time=0.016..0.039 rows=99 loops=1)</div>
<div style="text-align: justify;">
Index Cond: (c1 < 100)</div>
<div style="text-align: justify;">
Heap Fetches: 0</div>
<div style="text-align: justify;">
Planning Time: 0.166 ms</div>
<div style="text-align: justify;">
Execution Time: 0.074 ms</div>
<div style="text-align: justify;">
(5 rows)</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here, "Heap Fetches: 0" shows that the above query doesn't need to visit heap to fetch the required information. We can see that the time to complete the execution is reduced significantly in this case.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Improvement-3</b></div>
<div style="text-align: justify;">
<b>---------------------</b></div>
<div style="text-align: justify;">
Allow an (auto)vacuum to display additional information about the heap or index in case of an error. This feature could help users in case the database has some corruption. For example, if one of the indexes on a relation has some corrupted data (due to bad hardware or some bug), it will let the user know the index information, and the user can take appropriate action like either Reindex or maybe drop and recreate that particular index to overcome the problem. In the case of the heap, it displays the block number for which an error has occurred which makes it much easier for users and developers to detect the problem. In the worst case, if any particular block is corrupted in a table, users can remove all the rows of that particular block and the table can be used.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
vacuum pvac;</div>
<div style="text-align: justify;">
ERROR: error induced to demonstrate use of information</div>
<div style="text-align: justify;">
CONTEXT: while scanning block 2469 of relation "public.pvac"</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Next, I used "SELECT * FROM heap_page_items(get_raw_page('pvac', 2469));" to find the information of all line pointers in the page and then removed them from using below query. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# delete from pvac where ctid Between '(2469,1)' and '(2469,11)';</div>
<div style="text-align: justify;">
DELETE 11</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Then, I again ran vacuum</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
postgres=# vacuum pvac;</div>
<div style="text-align: justify;">
ERROR: error induced to demonstrate use of information</div>
<div style="text-align: justify;">
CONTEXT: while scanning block 2468 of relation "public.pvac"</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Now, you can see the block number is changed from 2469 to 2468 which means the vacuum could proceed. As, in this case, I have manually induced the error by changing code, so it occurs for every block but in reality it would be for some particular block(s) and once user can get rid of those block(s), the table can be reused. I don't want to say that is an ideal situation but at least it will allow users to proceed and it can help developers to narrow down the bug if there is any in the code.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Improvement-4</b></div>
<div style="text-align: justify;">
<b>---------------------</b></div>
<div style="text-align: justify;">
Autovacuum will now log WAL usage statistics along with other information. The WAL usage contains information on the total number of records, number of full pages images, and the total number of bytes. The buffer usage and WAL usage stats combined gives us approximate usage of I/O by a particular autovacuum run. I took one example runs information which will be shown below, one can refer to "WAL usage" to check the information on newly added stats.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
LOG: automatic vacuum of table "postgres.public.pvac": index scans: 1</div>
<div style="text-align: justify;">
pages: 0 removed, 2470 remain, 0 skipped due to pins, 0 skipped frozen</div>
<div style="text-align: justify;">
tuples: 100000 removed, 100000 remain, 0 are dead but not yet removable, oldest xmin: 529</div>
<div style="text-align: justify;">
buffer usage: 9276 hits, 4 misses, 3 dirtied</div>
<div style="text-align: justify;">
avg read rate: 0.062 MB/s, avg write rate: 0.047 MB/s</div>
<div style="text-align: justify;">
system usage: CPU: user: 0.40 s, system: 0.00 s, elapsed: 0.50 s</div>
<div style="text-align: justify;">
WAL usage: 7909 records, 2 full page images, 2276323 bytes</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Improvement-5</b></div>
<div style="text-align: justify;">
<b>---------------------</b></div>
<div style="text-align: justify;">
Make vacuum buffer counters 64-bits wide to avoid overflow of buffer usage stats. Without this feature, in extreme cases, if there exist tables that are large enough for 4 billion buffer accesses to be a possibility, the stats displayed are meaningless. See below example of 'buffer usage' stats from pgsql-hackers:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
LOG: automatic vacuum of table "somtab.sf.foobar": index scans: 17</div>
<div style="text-align: justify;">
pages: 0 removed, 207650641 remain, 0 skipped due to pins, 13419403 skipped frozen</div>
<div style="text-align: justify;">
tuples: 141265419 removed, 3186614627 remain, 87783760 are dead but not yet removable</div>
<div style="text-align: justify;">
buffer usage: -2022059267 hits, -17141881 misses, 1252507767 dirtied</div>
<div style="text-align: justify;">
avg read rate: -0.043 MB/s, avg write rate: 3.146 MB/s</div>
<div style="text-align: justify;">
system usage: CPU 107819.92s/2932957.75u sec elapsed 3110498.10 sec</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Improvement-6</b></div>
<div style="text-align: justify;">
<b>---------------------</b></div>
<div style="text-align: justify;">
Add wait event 'VacuumDelay' to report on cost-based vacuum delay. This will help us to monitor the [auto]vacuum throttling. This is a small feature but quite important as right now we have no way to monitor [auto]vacuum throttling and it is not very uncommon to see this in user environments.</div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com0tag:blogger.com,1999:blog-8673752770146776575.post-35580929295961082862020-02-19T03:07:00.000-08:002020-02-23T04:03:53.265-08:00Parallelism, what next?<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
This blog post is about the journey of parallelism in PostgreSQL till now and what is in store for the future. Since PostgreSQL 9.6 where the first feature of parallel query has arrived, each release improves it. Below is a brief overview of the parallel query features added in each release.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
PG9.6 has added Parallel execution of sequential scans, joins, and aggregates.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
PG10 has added (a) Support parallel B-tree index scans, (b) Support parallel bitmap heap scans, (c) Allow merge joins to be performed in parallel, (d) Allow non-correlated subqueries to be run in parallel, (e) Improve ability of parallel workers to return pre-sorted data and (f) Increase parallel query usage in procedural language functions.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
PG11 has added (a) Allow parallel building of a btree index, (b) Allow hash joins to be performed in parallel using a shared hash table, (c) Allow parallelization of commands CREATE TABLE ... AS, SELECT INTO, and CREATE MATERIALIZED VIEW, (d) Allow UNION to run each SELECT in parallel if the individual SELECTs cannot be parallelized, (e) Allow partition scans to more efficiently use parallel workers, (f) Allow LIMIT to be passed to parallel workers, this allows workers to reduce returned results and use targeted index scans, (g) Allow single-evaluation queries, e.g. WHERE clause aggregate queries, and functions in the target list to be parallelized.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
PG12 has added Allow parallelized queries when in SERIALIZABLE isolation mode.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>The progress for PG13 with respect to parallelism</b>. Some of the important advancements are:</div>
<div style="text-align: justify;">
(a) Parallel vacuum - This feature allows the vacuum to leverage multiple CPUs in order to process indexes. This enables us to perform index vacuuming and index cleanup with background workers. This adds a PARALLEL option to VACUUM command where the user can specify the number of workers that can be used to perform the command which is limited by the number of indexes on a table. Specifying zero as a number of workers will disable parallelism. For more information, see <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=40d964ec997f64227bc0ff5e058dc4a5770a70a9" target="_blank">commit</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
(b) Improve EXPLAIN's handling of per-worker details. This allows displaying the worker information in a much better way. The few visible side-effects as mentioned in the <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=10013684970453a0ddc86050bba813c611114321" target="_blank">commit</a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
* In text format, instead of something like</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Sort Method: external merge Disk: 4920kB</div>
<div style="text-align: justify;">
Worker 0: Sort Method: external merge Disk: 5880kB</div>
<div style="text-align: justify;">
Worker 1: Sort Method: external merge Disk: 5920kB</div>
<div style="text-align: justify;">
Buffers: shared hit=682 read=10188, temp read=1415 written=2101</div>
<div style="text-align: justify;">
Worker 0: actual time=130.058..130.324 rows=1324 loops=1</div>
<div style="text-align: justify;">
Buffers: shared hit=337 read=3489, temp read=505 written=739</div>
<div style="text-align: justify;">
Worker 1: actual time=130.273..130.512 rows=1297 loops=1</div>
<div style="text-align: justify;">
Buffers: shared hit=345 read=3507, temp read=505 written=744</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
you get</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Sort Method: external merge Disk: 4920kB</div>
<div style="text-align: justify;">
Buffers: shared hit=682 read=10188, temp read=1415 written=2101</div>
<div style="text-align: justify;">
Worker 0: actual time=130.058..130.324 rows=1324 loops=1</div>
<div style="text-align: justify;">
Sort Method: external merge Disk: 5880kB</div>
<div style="text-align: justify;">
Buffers: shared hit=337 read=3489, temp read=505 written=739</div>
<div style="text-align: justify;">
Worker 1: actual time=130.273..130.512 rows=1297 loops=1</div>
<div style="text-align: justify;">
Sort Method: external merge Disk: 5920kB</div>
<div style="text-align: justify;">
Buffers: shared hit=345 read=3507, temp read=505 written=744</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
(c) Avoid unnecessary shm writes in Parallel Hash Join. This improves the performance of Parallel Hash Join by a significant amount on large systems running many-core joins. Though this work has been back-patched to v11 where Parallel Hash Join was introduced, I mentioned it here as it is done during PG13 development. For more information, see <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3e4818e9dd5be294d97ca67012528cb1c0b0ccaa" target="_blank">commit</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>What is being discussed for the future:</b></div>
<div style="text-align: justify;">
(a) Parallel grouping sets - PostgreSQL already supports parallel aggregation by aggregating in two stages. First, each process participating in the parallel portion of the query performs an aggregation step, producing a partial result for each group of which that process is aware. Second, the partial results are transferred to the leader via the Gather node. Finally, the leader re-aggregates the results across all workers in order to produce the final result.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Next, there has been a <a href="https://www.postgresql.org/message-id/flat/CAN_9JTx-k6cQdrLAFQrAa+Ozx7cET3M3rFe+u1YgaHjYpz8X0g@mail.gmail.com" target="_blank">discussion in the community</a> to parallelize queries containing grouping sets in much the same way as we do parallel aggregation.</div>
<div style="text-align: justify;">
Basically, the aim is to parallelize queries like SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());</div>
<div style="text-align: justify;">
This feature has been proposed for PG13, but yet not committed.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
(b) Parallel copy - We also want to parallelize the Copy command, in particular "Copy <table_name> from .. ;" command. This will help improve the bulk load operation in PostgreSQL. Currently, we do a lot of work during the Copy command. We read the file in 64KB chunks, then find the line endings and process that data line by line, where each line corresponds to one tuple. We first form the tuple (in form of value/null array) from that line, check if it qualifies the where condition and if it qualifies, then perform constraint check and few other checks and then finally store it in local tuple array. Once we reach 1000 tuples or consumed 64KB (whichever occurred first), we insert them together and then for each tuple insert into the index(es) and execute after row triggers. The aim of this work is to parallelize as much as possible the work done during the copy. There is an ongoing <a href="https://www.postgresql.org/message-id/CAA4eK1%2BkpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV%3Dg%40mail.gmail.com" target="_blank">discussion in the community </a>on this topic.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
There is a small caveat here that to achieve parallel copy, we need to work on relation extension lock where parallel workers block each other while extending the relation which is not the case currently. There is already a discussion on this <a href="https://www.postgresql.org/message-id/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B%2BSs%3DGn1LA%40mail.gmail.com" target="_blank">topic in the community</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
(c) Parallel file_fdw - The <a href="https://www.postgresql.org/message-id/CA%2BhUKGKZu8fpZo0W%3DPOmQEN46kXhLedzqqAnt5iJZy7tD0x6sw%40mail.gmail.com" target="_blank">proposed work</a> in this area allows file_fdw to divide its scan up for parallel workers, much like a parallel seq scan.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
There are more areas where parallelism can be used like parallel DMLs (inserts, updates, deletes). During a discussion with <a href="https://twitter.com/mengtangmu?lang=en" target="_blank">Thomas Munro</a>, it came up that it would be beneficial if we can parallelize index creation and index scans for indexes other than btree especially gin and gist. Note that we already support <a href="https://amitkapila16.blogspot.com/2018/05/parallel-index-scans-in-postgresql.html" target="_blank">parallel index scans</a> and <a href="https://www.depesz.com/2018/02/12/waiting-for-postgresql-11-support-parallel-btree-index-builds/" target="_blank">parallel index creation</a> for btree. I would not like to go in detail of these operations as till now we haven't seen any proposal for those. Similarly, we can improve few things in our current parallel infrastructure (a) As of now, for each query the parallel workers are created and destroyed, instead we can have some pool of parallel query workers which can avoid the overhead of starting them for each query, (b) As of now, each worker can use up to work_mem of memory which might increase the overall memory usage of query. We might want to improve this, but currently, there is no proposal for this.</div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com16tag:blogger.com,1999:blog-8673752770146776575.post-32387418941376201492018-05-25T20:34:00.000-07:002018-05-29T18:50:33.532-07:00Parallel Index Scans In PostgreSQL<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
There is a lot to say about parallelism in PostgreSQL. We have come a long way since I wrote my first post on this topic (<a href="http://amitkapila16.blogspot.in/2015/11/parallel-sequential-scans-in-play.html">Parallel Sequential Scans</a>). Each of the past three releases (including <a href="https://www.postgresql.org/docs/11/static/release-11.html">PG-11</a>, which is in its beta) have a parallel query as a major feature which in itself says how useful is this feature and the amount of work being done on this feature. You can read more about parallel query from the <a href="https://www.postgresql.org/docs/11/static/parallel-query.html">PostgreSQL docs</a> or from a <a href="http://rhaas.blogspot.in/2017/03/parallel-query-v2.html">blog post</a> on this topic by my colleague <a href="https://www.linkedin.com/in/robertmhaas/">Robert Haas</a>. The intent of this blog post is to talk about <a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5262f7a4fc44f651241d2ff1fa688dd664a34874" target="_blank">parallel index scans</a> which were released in PostgreSQL 10. Currently, we have supported parallel scan for btree-indexes.<br />
<br />
To demonstrate how the feature works, here is an example of TPC-H Q-6 at scale factor - 20 (which means approximately 20GB database). Q6 is a forecasting revenue change query. This query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues.<br />
<br />
explain analyze<br />
<div>
select sum(l_extendedprice * l_discount) as revenue<br />
from lineitem<br />
<div>
where l_shipdate >= date '1994-01-01' and</div>
<div>
l_shipdate < date '1994-01-01' + interval '1' year and</div>
<div>
l_discount between 0.02 - 0.01 and 0.02 + 0.01 and</div>
<div>
l_quantity < 24<br />
LIMIT 1;<br />
<br />
Non-parallel version of plan<br />
-------------------------------------<br />
Limit<br />
-> Aggregate<br />
-> Index Scan using idx_lineitem_shipdate on lineitem<br />
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01<br />
<div>
00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND</div>
<div>
(l_discount <= 0.03) AND (l_quantity < '24'::numeric))</div>
<div>
Planning Time: 0.406 ms<br />
Execution Time: 35073.886 ms<br />
<br />
Parallel version of plan<br />
-------------------------------<br />
Limit<br />
-> Finalize Aggregate<br />
-> Gather</div>
<div>
Workers Planned: 2</div>
<div>
Workers Launched: 2<br />
-> Partial Aggregate<br />
-> Parallel Index Scan using idx_lineitem_shipdate on lineitem<br />
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01 </div>
<div>
00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND</div>
<div>
(l_discount <= 0.03) AND (l_quantity < '24'::numeric))</div>
<div>
Planning Time: 0.420 ms<br />
Execution Time: 15545.794 ms<br />
<br />
We can see that the execution time is reduced by more than half for a parallel plan with two parallel workers. This query filters many rows and the work (CPU time) to perform that is divided among workers (and leader), leading to reduced time.<br />
<br />
To further see the impact with a number of workers, we have used somewhat bigger dataset (scale_factor = 50). The setup has been done using <a href="https://github.com/tvondra/pg_tpch">TPC-H like benchmark for PostgreSQL</a>. We have also created few additional indexes on columns (l_shipmode, l_shipdate, o_orderdate, o_comment)<br />
<br />
<b>Non-default parameter settings:</b><br />
random_page_cost = seq_page_cost = 0.1<br />
effective_cache_size = 10GB<br />
shared_buffers = 8GB<br />
work_mem = 1GB<b style="font-weight: normal;"><br /></b><br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><img height="340" src="https://lh3.googleusercontent.com/ndH5G5jDgu2Et-4XHNW4Pv2AVCDVE76LtRZqDiMRNwW7a4Leh52rqW_OFPL2ehfBUV8FBWR4qhtFqPfjbh2Znn84O34M3dY8NfrwSexTPI5k_vj9u9HMipdrDDS-OVlP_PYxaCgl" style="-webkit-transform: rotate(0.00rad); border: none; transform: rotate(0.00rad);" width="605" /></span></div>
<b style="font-weight: normal;"><br /></b>
<br />
<br />
The time is reduced almost linearly till 8 workers and then it reduced slowly. The further increase in workers won’t help unless the data to scan increases.<br />
<div>
<br />
We have further evaluated the parallel index scan feature for all the queries in TPC-H benchmark and found that it is used in a number of queries and the impact is positive (reduced the execution time significantly). Below are results for TPC-H, scale factor - 20 with a number of parallel workers as 2. X-axis indicates (1: Q-6, 2: Q14, 3: Q18).<br />
<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><img height="340" src="https://lh4.googleusercontent.com/wFm9_N1ptn5FSiMUGlheNpH4mde4go3pwVzzI23-3nTn_YTyOkj_gpTrB96QjaR0q83udzWP0VuHfP2DZ1czNw4v4FeygBWJMbCPxFWrBpy_iuAVcUYMCIfOEiHnLKLxjJIFQRY5" style="-webkit-transform: rotate(0.00rad); border: none; transform: rotate(0.00rad);" width="605" /></span></div>
<b style="font-weight: normal;"><br /></b> <b>Under the Hood</b></div>
<div>
The basic idea is quite similar to parallel heap scans where each worker (including leader whenever possible) will scan a block (all the tuples in a block) and then get the next block that is required to be scan. The parallelism is implemented at the leaf level of a btree. The first worker to start a btree scan will scan till it reaches the leaf and others will wait till the first worker has reached the leaf. Once, the first worker read the leaf block, it sets the next block to be read and wakes one of the workers waiting to scan blocks. Further, it proceeds scanning tuples from the block it has read. Henceforth, each worker after reading a block sets the next block to be read and wakes up the next waiting worker. This continues till no more pages are left to scan at which we end the parallel scan and notify all the workers.<br />
<br />
A new guc min_parallel_index_scan_size has been introduced which indicates the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Users can try changing the value of this parameter to see if the parallel index plan is effective for their queries. The number of parallel workers is decided based on the number of index pages to be scanned. The final cost of parallel plan considers the cost (CPU cost) to process the rows will be divided equally among workers.<br />
<br />
In the end, I would like to thank the people (<a href="https://www.linkedin.com/in/rahila-syed-43057937/">Rahila Syed</a> and <a href="https://www.linkedin.com/in/robertmhaas/">Robert Haas</a>) who were involved in this work (along with me) and my employer <a href="https://www.enterprisedb.com/">EnterpriseDB</a> who has supported this work. I would also like to thank <a href="https://www.linkedin.com/in/rafia-sabih-b38721b9/">Rafia Sabih</a> who helped me in doing performance testing for this blog.</div>
</div>
</div>
</div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com4tag:blogger.com,1999:blog-8673752770146776575.post-43848239621488664512018-03-05T19:07:00.001-08:002018-03-06T05:19:59.666-08:00zheap: a storage engine to provide better control over bloat<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: #222222; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In the past few years, PostgreSQL has advanced a lot in terms of features, performance, </span><span style="background-color: white; color: #222222; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">and </span><span style="background-color: white; color: #222222; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">scalability for many-core systems. However, one of the problems that many enterprises still </span><span style="background-color: white; color: #222222; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">complain is that its size increases over time which is commonly referred to as bloat. </span><span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;">PostgreSQL </span><span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;">has</span><span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;"> a mechanism known as </span>autovacuum<span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;"> wherein a dedicated process (or set of processes) tries to </span><span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;">remove the dead rows from the relation in an attempt to reclaim the space, but it can’t completely </span><span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;">reclaim the space in many cases. In particular, it always creates a new version of a tuple on an </span><span style="background-color: white; color: #222222; font-family: "arial"; white-space: pre-wrap;">update which must eventually be removed by periodic vacuuming or by </span><span style="background-color: white; color: #222222; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">HOT-pruning, but still in many cases space is never reclaimed completely. A similar problem occurs for tuples that are deleted. This leads to bloat in the database. My colleague Robert Haas has discussed some such cases in his blog </span><a href="http://rhaas.blogspot.in/2018/01/do-or-undo-there-is-no-vacuum.html" style="text-decoration-line: none;"><span style="background-color: white; color: #1155cc; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">DO or UNDO - there is no VACUUM</span></a><span style="background-color: white; color: #222222; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;"> where the PostgreSQL heap tends to bloat and has also mentioned the solution (zheap: a new storage format for PostgreSQL) on which EnterpriseDB is working to avoid the bloat whenever possible. The intent of this blog post is to elaborate on that work in some more detail and show some results.</span></div>
<div style="text-align: left;">
<b id="docs-internal-guid-323f84fd-f933-9314-aaf5-1e0a777f1d8a" style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">This project has three major objectives:</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">1. Provide better control over bloat. zheap will prevent bloat (a) by allowing in-place updates in common </span><span style="background-color: white; font-family: "arial"; white-space: pre-wrap;">cases and (b) by reusing space as soon as a transaction that has performed a delete or </span><span style="background-color: white; font-family: "arial"; white-space: pre-wrap;">non-in-place update has committed. In short, with this new storage, whenever possible, we’ll avoid </span><span style="background-color: white; font-family: "arial"; white-space: pre-wrap;">creating bloat in the first place.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">2. Reduce write amplification both by avoiding rewrites of heap pages and by making it possible to do an update that touches indexed columns without updating every index.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">3. Reduce the tuple size by (a) shrinking the tuple header and (b) eliminating most alignment padding.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In this blog post, I will mainly focus on the first objective (Provide better control over bloat) and leave other things for future blog posts on this topic.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In-place updates will be supported except when (a) the new tuple is larger than the old tuple and the increase in size makes it impossible to fit the larger tuple onto the same page or (b) some column is modified which is covered by an index that has not been modified to support “delete-marking”. Note that the work to support delete-marking in indexes is yet to start and we intend to support it at least for btree indexes. For in-place updates, we have to write the old tuple in the undo log and the new tuple in the zheap which help concurrent readers to read the old tuple from undo if the latest tuple is not yet visible to them.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Deletes write the complete tuple in the undo record even though we could get away with just writing the TID as we do for an insert operation. This allows us to reuse the space occupied by the deleted record as soon as the transaction that has performed the operation commits. Basically, if the delete is not yet visible to some concurrent transaction, it can read the tuple from undo and in heap, we can immediately (as soon as the transaction commits) reclaim the space occupied by the record. </span></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Below are some of the graphs that compare the size of heap and zheap table when the table is constantly updated and there is a concurrent long-running transaction. To perform these tests, we have used pgbench to initialize the data (at scale factor 1000) and then use the simple-update test (which comprises of one-update, one-select, one-insert) to perform updates. You can refer to the PostgreSQL manual for more about how to use </span><a href="https://www.postgresql.org/docs/devel/static/pgbench.html" style="text-decoration: none;"><span style="background-color: white; color: #1155cc; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;">pgbench</span></a><span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">. These tests have been performed on a machine with an x86_64 architecture, 2-sockets, 14-cores per socket, 2-threads per-core and has 64-GB RAM. The non-default configuration for the tests is </span><span style="background-color: white; color: #222222; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">shared_buffers=32GB, min_wal_size=15GB, max_wal_size=20GB, checkpoint_timeout=1200, </span><span style="background-color: white; color: #222222; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">maintenance_work_mem=1GB, checkpoint_completion_target=0.9, </span><span style="background-color: white; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">synchoronous_commit = off. The below graphs show the size of the table on which this test has performed updates.</span></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br class="kix-line-break" /></span></div>
<div style="height: 0px;">
<br /></div>
<br />
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><img height="149" src="https://lh4.googleusercontent.com/1ETRS6EkB_9xfbCOakA1NA7HHKHcat8GgqLy_hhrBfVZq8gs8Onp3feHcZMFnqQ1_8MHDw6LQoGRsEVDJzXWlWUpTr4Ksn_6WRbKQRrn-bCa6CsoAHhTKL9F7gBAptrNA0__GrtK" style="border: none; transform: rotate(0rad);" title="Chart" width="400" /></span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">In the above test, we can see that the initial size of the table was 13GB in heap and 11GB in zheap. After running the test for 25 minutes (out of which there was an open transaction for first 15-minutes), the size in heap grows to 16GB at 8-client count test and to 20GB at 64-client count test whereas for zheap the size remains at 11GB for both the client-counts at the end of the test. The initial size of zheap is lesser because the tuple header size is smaller in zheap. Now, certainly for first 15 minutes, autovacuum can’t reclaim any space due to the open transaction, but it can’t reclaim it even after the open transaction is ended. On the other hand, the size of zheap remains constant and all the undo data generated is removed within seconds of the transaction ending.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">Below are some more tests where the transaction has been kept open for a much longer duration.</span></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><img height="156" src="https://lh4.googleusercontent.com/BIygUbS4_Yg2DdtsoruG2GkXKvQzYX7-LXnOwTza1JI2kjGVG-m2_3s3WocAWsldtre-NqAvvxJAjtMid0HljdjJ-LVCnnLCB1-IWuxrB7bxc0F5YJLhhNeiw-9PATLXFSLJWd2h" style="border: none; transform: rotate(0rad);" title="Chart" width="400" /></span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">After running the test for 40 minutes (out of which there was an open transaction for first 30-minutes), the size in heap grows to 19GB at 8-client count test and to 26GB at 64-client count test whereas for zheap the size remains at 11GB for both the client-counts at the end of test and all the undo generated during test gets discarded within a few seconds after the open transaction is ended.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><img height="155" src="https://lh3.googleusercontent.com/CrDkir3CfNQszMMiXXwjmRAynj6O3k5Yt0pGt9bkJC5ThpDdNm8TAM53DyHyM90R7JxOMm9dnVbSo3kXeThVP5VlPWx8NkOATxuEbmIbUK_LGOKAtf-W17EK4kTVCebXX-fn6Rf0" style="border: none; transform: rotate(0rad);" title="Chart" width="400" /></span></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">After running the test for 55 minutes (out of which there was an open transaction for first 45-minutes), the size in heap grows to 22GB at 8-client count test and to 28GB at 64-client count test whereas for zheap the size remains at 11GB for both the client-counts at the end of test and all the undo generated during test gets discarded within few seconds after the open transaction is ended.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">So from all the above three tests, it is clear that the size of heap keeps on growing as the time for a concurrent long-running transaction is increasing. It was 13GB at the start of the test, grew to 20GB, then to 26GB, then to 28GB at 64-client count test as the duration of the open transaction has increased from 15-mins to 30-mins and then to 45-mins. We have done a few more tests on the above lines and found that as the duration of open-transaction increases, the size of heap keeps on increasing whereas zheap remains constant. For example, similar to above, if we keep the transaction open 60-mins in a 70-min test, the size of heap increases to 30GB. The increase in size also depends on the number of updates that are happening as part of the test.</span></div>
<div style="text-align: left;">
<b style="font-weight: normal;"><br /></b></div>
<div style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt; text-align: left;">
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">The above results show not only the impact on size, but we also noticed that the TPS (transactions per second) in zheap is also always better (up to ~45%) for the above tests. In similar tests on some other high-end machine, we see much better results with zheap with respect to performance. I would like to defer the details about raw-performance of zheap vs. heap to another blog post as this blog has already become big. </span><span style="background-color: white; font-family: arial; white-space: pre-wrap;">I would like to mention that the above results don't mean that </span>zheap<span style="background-color: white; font-family: arial; white-space: pre-wrap;"> will be better in all cases than heap. For example, rollbacks will be costlier in zheap. </span><span style="background-color: white; font-family: arial; white-space: pre-wrap;">Just to be clear, this storage format is proposed as another format alongside current heap, so that users can decide which storage they want to use for their use case.</span><br />
<span style="background-color: white; color: black; font-family: "arial"; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div>
<div style="text-align: left;">
<span id="docs-internal-guid-323f84fd-f93e-480b-c1eb-62e0ad848aaf"><span style="background-color: white; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">The code for this project has been published and is proposed as a </span><a href="https://www.postgresql.org/message-id/CAA4eK1%2BYtM5vxzSM2NZm%2BpC37MCwyvtkmJrO_yRBQeZDp9Wa2w%40mail.gmail.com" style="text-decoration-line: none;"><span style="background-color: white; color: #1155cc; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">feature for PG-12</span></a><span style="background-color: white; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;"> to PostgreSQL community. Thanks to </span><a href="https://www.linkedin.com/in/kuntal-ghosh-4822073a/" style="text-decoration-line: none;"><span style="background-color: white; color: #1155cc; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;">Kuntal Ghosh</span></a><span style="background-color: white; font-family: "arial"; vertical-align: baseline; white-space: pre-wrap;"> for doing the performance tests mentioned in this blog post.</span></span></div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com7tag:blogger.com,1999:blog-8673752770146776575.post-52607968501490945732017-03-17T08:12:00.000-07:002017-03-27T05:45:45.235-07:00Hash indexes are faster than Btree indexes?<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
PostgreSQL have supported Hash Index for a long time, but they are not much used in production mainly because they are not durable. Now, with the next version of PostgreSQL, they will be durable. The immediate question is how do they perform as compared to Btree indexes. There is a lot of work done in the coming version to make them faster. There are multiple ways in which we can compare the performance of Hash and Btree indexes, like the time taken for creation of the index, search or insertion in the index. This blog will mainly focus on the search operation. By definition, hash indexes are O(1) and Btree indexes are O(log n), however with duplicates that is not exactly true.<br />
<br />
To start with let us see the impact of work being done to improve the performance of hash indexes. Below is the performance data of the pgbench read-only workload to compare the performance difference of Hash indexes between 9.6 and HEAD on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijPHohSTGm3pOAa3N507PzkKSgLaaXrFTSEIjJycQ2CqvvCTlCpI0HaS9vy6iL9OaRRH9WoLaY0y6eYsp94MSgEWbMapBfplvOSnGDUmjXU1YKBdTENeo1FaA3CVIYYVhS_VOKaBUUN8zH/s1600/read_only_hash_index.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijPHohSTGm3pOAa3N507PzkKSgLaaXrFTSEIjJycQ2CqvvCTlCpI0HaS9vy6iL9OaRRH9WoLaY0y6eYsp94MSgEWbMapBfplvOSnGDUmjXU1YKBdTENeo1FaA3CVIYYVhS_VOKaBUUN8zH/s400/read_only_hash_index.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<span id="goog_919865474"></span><br />
The workload is such that all the data fits in shared buffers (scale factor is 300 (~4.5GB) and shared_buffers is 8GB). As we can see from the above graph, that the performance has increased at all client counts in the range of 7% to 81% and the impact is more pronounced at higher client counts. The main work which has led to this improvement is 6d46f478 (Improve hash index bucket split behavior.) and 293e24e5 (Cache hash index's metapage in rel->rd_amcache.).<br />
<br />
The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages. In general, acquiring the heavyweight lock is costlier as compare to lightweight locks. In addition to reducing the locking cost, this also avoids locking out scans and inserts for the lifetime of the split.<br />
<br />
The second commit 293e24e5 avoids a significant amount of contention for accessing metapage. Each search operation needs to access metapage to find the bucket that contains tuple being searched which leads to high contention around metapage. Each access to metapage needs to further access buffer manager. This work avoids that contention by caching the metapage information in backend local cache which helps bypassing all the buffer manager related work and hence the major contention in accessing the metapage.<br />
<br />
<br />
The next graph shows how the hash index performs as compared to the btree index. In this run we have changed hash to btree index in pgbench read-only tests.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtpaP-qu2UHlFamkS74yrsuUsD_H19hw7yQ0lzlkAe5dDboKl6yhww1dUzovnTnvy4uDB5zFM3Ck_1_BXRsNX9bVrHariYxu2qIY4KoCtUbOpHPh8UsNMi7GqrUXDpUBHDOjHeDfWEVKgD/s1600/read_only_hash_btree.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtpaP-qu2UHlFamkS74yrsuUsD_H19hw7yQ0lzlkAe5dDboKl6yhww1dUzovnTnvy4uDB5zFM3Ck_1_BXRsNX9bVrHariYxu2qIY4KoCtUbOpHPh8UsNMi7GqrUXDpUBHDOjHeDfWEVKgD/s400/read_only_hash_btree.png" width="400" /></a></div>
<br />
<br />
We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%. In some other workloads we have seen a better performance like with hash index on varchar columns and even in the <a href="https://www.postgresql.org/message-id/4575a870-1315-18ac-0516-c21a83a7afdf%40redhat.com" target="_blank">community</a>, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.<br />
<br />
<br />
The important thing to note about the above data is that it is only on some of the specific workloads and it mainly covers Selects as that is the main area where performance improvement work has been done for PostgreSQL10. The other interesting parameters to compare are the size of the index and update on the index which needs more study and experiments.<br />
<br />
In the end, I would like to thank my colleagues who were directly involved in this work and my employer <a href="https://www.enterprisedb.com/" target="_blank">EnterpriseDB</a> who has supported this work. Firstly I would like to thank, Robert Haas who has envisioned all this work and is the committer of this work, and Mithun C Y who was the author of commit 293e24e5. Also, I would like to extend sincere thanks to all the community members who are involved in this work and especially Jeff Janes and Jesper Pedersen who have reviewed and tested this work.</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com13tag:blogger.com,1999:blog-8673752770146776575.post-72585853935399155992016-03-11T07:56:00.000-08:002016-03-11T07:56:37.958-08:00Troubleshooting waits in PostgreSQL<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Currently when the PostgreSQL database becomes slow especially on systems with high load, it becomes difficult to find the exact reasons. Currently one can use tools like perf, strace, dynamic tracing <span style="color: #222222; font-family: Helvetica;">(</span><a href="http://www.postgresql.org/docs/devel/static/dynamic-trace.html" style="color: #1155cc;" target="_blank">http://www.postgresql.org/<wbr></wbr>docs/devel/static/dynamic-<wbr></wbr>trace.html</a><span style="color: #222222; font-family: Helvetica;">), etc. to find out the reasons of slowdown, but most of the times they are quite inconvenient to use which lead to the development of the new feature to display wait events information in pg_stat_activity view. Wait events are invented to capture the information of system blocks or waits to perform some action like waiting for another backend process to release the heavyweight or lightweight locks, waits to access data buffer when no other process can be examining the buffer, waits to read or write the data to disk, etc. As </span>part of initial feature, we have covered some of the common wait event types due to which there are waits in system, however it is designed such that it can be extended to capture other types of wait events as well.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
I will briefly explain the wait event types covered as part of this feature and then explain with examples, how one can use this feature to find stalls or waits in the system. First wait event type is lightweight lock which is used to protect a particular data structure in shared memory. Second wait event type is named lightweight lock tranche, this indicates that the server process is waiting for one of a group of related lightweight locks. Third wait event type is heavyweight lock which is used to primarily protect SQL-visible objects such as tables. Fourth type of wait event is BufferPin where the server process waits to access to a data buffer during a period when no other process can be examining that buffer. For detail explanation, refer PostgreSQL documentation at <a href="http://www.postgresql.org/docs/devel/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW" style="color: #1155cc;" target="_blank">http://www.postgresql.org/<wbr></wbr>docs/devel/static/monitoring-<wbr></wbr>stats.html#PG-STAT-ACTIVITY-<wbr></wbr>VIEW</a></div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Now, let us try to understand with the help of simple examples, how to find waits in the system using this powerful tool.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Create table and insert data which will be used in below examples:</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# create table wait_event_tbl(c1 int);</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
CREATE TABLE</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# insert into wait_event_tbl values(1);</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
INSERT 0 1</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
wait event type - Lock (Heavyweight locks)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
------------------------------<wbr></wbr>-------------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Scenario - 1</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Let us try to examine the waits for a scenario where one of the session has acquired Access Exclusive Lock on a table and the other session wants to acquire Access Share Lock on the same table and is waiting for first session to complete it's transaction.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session -1</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
6088</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# begin;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
BEGIN</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# Lock wait_event_tbl in Access Exclusive Mode;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
LOCK TABLE</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session-2</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
1152</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# begin;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
BEGIN</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# Lock wait_event_tbl in Access Share Mode;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session-3</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pid, wait_event_type, wait_event from pg_stat_activity where wait_event is NOT NULL;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pid | wait_event_type | wait_event</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
------+-----------------+-----<wbr></wbr>-------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
1152 | Lock | relation</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Here, via above statement, it is shown that session-2 is waiting for a Lock on a relation. To know more information about relation, one can add "query" column in the above statement.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Scenario - 2</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Three sessions try to update the same row, first one will be successful and the other two will be waiting.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session -1</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
6088</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# begin;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
BEGIN</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# update wait_event_tbl set c1 = 2 where c1=1;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
UPDATE 1</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session - 2</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
1152</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# begin;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
BEGIN</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# update wait_event_tbl set c1 = 3 where c1 = 1;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session - 3</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
5404</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# begin;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
BEGIN</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# update wait_event_tbl set c1 = 4 where c1 = 1;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session - 4</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pid, wait_event_type, wait_event from pg_stat_activity where wait_event is NOT NULL;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pid | wait_event_type | wait_event</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
------+-----------------+-----<wbr></wbr>----------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
1152 | Lock | transactionid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
5404 | Lock | tuple</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(2 rows)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Here, above statement indicates that session-2 and session-3 are waiting.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
To find detailed information about locks, you can join this table information with pg_locks as described in link:<a href="https://wiki.postgresql.org/wiki/Lock_Monitoring" style="color: #1155cc;" target="_blank">https://wiki.postgresql.<wbr></wbr>org/wiki/Lock_Monitoring</a> or some other similar way.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
wait event type - LWLockName (Lightweight Locks)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
------------------------------<wbr></wbr>----------------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
One session trying to execute the update statement and other session is trying to execute select statement can block each other for short time.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session - 1</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
1152</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# update wait_event_tbl set c1 = 2;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session - 2</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pg_backend_pid();</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pg_backend_pid</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
----------------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
6088</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select * from wait_event_tbl;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
Session - 3</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
postgres=# select pid, wait_event_type, wait_event from pg_stat_activity where wait_event is NOT NULL;</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
pid | wait_event_type | wait_event</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
------+-----------------+-----<wbr></wbr>----------</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
1152 | LWLockNamed | ProcArrayLock</div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
(1 row)</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
I have created this scenario with the help of debugger, but it is quite possible to see such wait events during high load on the system.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
One point to note for users who are using "waiting" column of pg_stat_activity to find blocking statements is that they need to change their queries for next version (presumably 9.6) of PostgreSQL as waiting column is removed from pg_stat_activity. This is an intentional decision taken by PostgreSQL community for the ease of use and or understanding of this feature especially for future versions.</div>
<div style="background-color: white; color: #222222; font-family: Helvetica; min-height: 11px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: Helvetica;">
This feature has been committed in PostgreSQL code. For details, you can refer commit id - 53be0b1add7064ca5db3cd884302df<wbr></wbr>c3268d884e. It took us approximately 9 months to complete this feature. Thanks to all<span style="color: #222222; font-family: Helvetica;"> the PostgreSQL community members who have given their valuable feedback throughout the development of this feature and special thanks to <a href="http://rhaas.blogspot.in/" target="_blank">Robert Haas</a> and Ildus Kurbangaliev for giving tremendous support to me both by reviews and </span>by helping in writing parts of code. Also Thanks to Alexander Korotkov for review and inputs for this feature and last but not least Thanks to Thom Brown for inputs in documentation of this feature.</div>
<br /></div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com4tag:blogger.com,1999:blog-8673752770146776575.post-19047035590953721222015-11-29T21:57:00.000-08:002015-11-29T21:57:43.825-08:00Parallel Sequential Scans in play<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
Parallelism is now reality in PostgreSQL. With 9.6, I hope we will see many<br />
different form of queries that can use parallelism to execute. For now, I will<br />
limit this discussion to what we can already do, which is Parallel Sequential<br />
Scans.<br />
<br />
Parallel Sequential Scans are used to scan a relation parallely with the help of<br />
background workers which in turns improve the performance of such scans. I<br />
will discuss about the scenarios where user can expect a performance boost<br />
due to this feature later in this blog, but first let us understand the basic feature<br />
and how it works. Three new GUC parameters have been added to tune the<br />
usage of this feature.<br />
<br />
max_parallel_degree - This is used to set the maximum number of workers that<br />
can be used for an individual parallel operation. It is very well possible that the<br />
requested number of workers are not available at execution time. Parallel workers<br />
are taken from the pool of processes established by max_worker_processes which<br />
means that value of max_parallel_degree should be lesser than max_worker_processes.<br />
It might not be useful to set the value of this parameter more than the number of CPU<br />
count on your system.<br />
<br />
parallel_tuple_cost - This is used by planner to estimate the cost of transferring a<br />
tuple from parallel worker process to master backend. The default is 0.1. The more<br />
the number of tuples that needs to be passed from worker backend processes to<br />
master backend process, the more this cost will be and more overall cost of<br />
parallel sequential scan plan.<br />
<br />
parallel_setup_cost - This is used by planner to estimate the cost of launching parallel<br />
worker processes and setting up dynamic shared memory to communicate.<br />
The default is 1000.<br />
<br />
Now let us see the simple example to demonstrate how parallel sequential scan works:<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> create table tbl_parallel_test(c1 int, c2 char(1000));
insert into tbl_parallel_test values(generate_series(1,1000000),'aaaaa');
Analyze tbl_parallel_test;
Explain analyze select * from tbl_parallel_test where c1 < 10000 and
c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test
(cost=0.00..157858.09 rows=1 width=1008)
(actual time=378.414..378.414 rows=0 loops=1)
Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))
Rows Removed by Filter: 1000000
Planning time: 0.075 ms
Execution time: 378.431 ms
(5 rows)
</code></pre>
<br />
Set the max parallel degree to enable the use of parallelism in queries.
<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> set max_parallel_degree = 6;
Explain analyze select * from tbl_parallel_test where c1 < 10000
and c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..29701.57 rows=1 width=1008)
(actual time=182.708..182.708 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test
(cost=0.00..28701.47 rows=1 width=1008)
(actual time=179.496..1081.120 rows=0 loops=1)
Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))
Rows Removed by Filter: 1000000
Planning time: 0.078 ms
Execution time: 200.610 ms
(7 rows)
</code></pre>
<br />
Here, we can see how changing max_parallel_degree allows the usage of parallel workers<br />
to perform parallel sequential scans. We can notice in above example that even though we<br />
have set max_parallel_degree as 6, still it uses 5 workers and the reason for same is that<br />
currently the parallel workers are choosen based on size of relation.<br />
<br />
Next, let us discuss about usage of functions in parallel query. A new clause PARALLEL<br />
is added to the CREATE FUNCTION statement. There are three valid values that can be<br />
used by user with this clause.<br />
<br />
1. PARALLEL Unsafe - This indicates that the function can't be executed in parallel mode<br />
and the presence of such a function in a SQL statement forces a serial execution plan.<br />
2. PARALLEL Restricted - This indicates that the function can be executed in parallel mode,<br />
but the execution is restricted to parallel group leader. As of now, if the qualification for any<br />
particular relation has anything that is parallel restricted, that relation won't be chosen for<br />
parallelism.<br />
3. Parallel Safe - This indicates that the function is safe to run in parallel mode without<br />
restriction.<br />
<br />
The default value for function is PARALLEL Unsafe.<br />
<br />
Now let us see the impact of using Parallel Safe and Unsafe function in the queries. I will<br />
continue using the query used in previous example to explain the concept.<br />
<br />
Create a Parallel Safe function<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> create or replace function calc_factorial(a integer, fact_val integer)
returns integer
as $$
begin
perform (fact_val)!;
return a;
end;
$$ language plpgsql PARALLEL Safe;
</code></pre>
Use it in query<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> Explain analyze select * from tbl_parallel_test where
c1 < calc_factorial(10000, 10)
and c2 like '%bb%';
QUERY PLAN
--------------------------------------------------------------------------------
Gather (cost=1000.00..75154.99 rows=1 width=1008)
(actual time=120566.456..120566.456 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test
(cost=0.00..74154.89 rows=1 width=1008)
(actual time=119635.421..359721.498 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 54.904 ms
Execution time: 120622.631 ms
(7 rows)
</code></pre>
<br />
Here we can see that Parallel Plan is chosen and the parallel safe function<br />
is pushed to workers for evaluation of quals.<br />
<br />
Now lets change that function as Parallel Unsafe and see how the above<br />
query behaves.<br />
<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> Alter Function calc_factorial(integer, integer) PARALLEL Unsafe;
Explain analyze select * from tbl_parallel_test where
c1 < calc_factorial(10000, 10)
and c2 like '%bb%';
QUERY PLAN
--------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test
(cost=0.00..407851.91 rows=1 width=1008)
(actual time=33166.138..33166.138 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 0.162 ms
Execution time: 33166.208 ms
(5 rows)
</code></pre>
<br />
So using parallel unsafe functions in queries would lead to serial plans.<br />
<br />
Next, let us see the Performance characteristics of Parallelism:<br />
<br />
Non-default settings used to collect performance data:<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> shared_buffers=32GB; min_wal_size=5GB; max_wal_size=10GB
checkpoint_timeout =30min; max_connections=300;
max_worker_processes=100;
</code></pre>
<br />
Test setup<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> create table tbl_perf(c1 int, c2 char(1000));
insert into tbl_perf values(generate_series(1,30000000),'aaaaa');
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial($1,10) and
c2 like '%aa%'; </code></pre>
The function calc_factorial is same as used in previous example and the values passed<br />
to it are such that the desired percentage of rows can be selected. Example<br />
<pre style="background-image: URL(https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNAc-M-B8CxbfTClfbKc3MBVqJ3FeSGhayZIxcW-4X4kRS40U6A8B-g_03xrhg1-SPMNvIJCBsrj85yBryS1iqVVpeR5l3ze6zr4GXe0A12Ip-L81n4fUwydt1JWJYx4vwUM7M-HvXGnHt/s320/codebg.gif); background: #f0f0f0; border: 1px dashed #CCCCCC; color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> --"to select 1% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(29700000,10) and
c2 like '%aa%';"
--"to select 10% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(27000000,10) and
c2 like '%aa%';"
--"to select 25% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(22500000,10) and
c2 like '%aa%';"
</code></pre>
Performance Data -<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaNZyKDJRDB_7yHrQ3YvMf8WsJzr_y61JWB0jqbqBFLRQqCSbh7-iwQy_nqeARkiaomRBDqvyjKGIPgP_EOZqaYUUViippNpVdTnign-AQXMIR2Vi_HxV9EHyOnH0FR_kQb5vwW8IanxY3/s1600/parallel_seqscan.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaNZyKDJRDB_7yHrQ3YvMf8WsJzr_y61JWB0jqbqBFLRQqCSbh7-iwQy_nqeARkiaomRBDqvyjKGIPgP_EOZqaYUUViippNpVdTnign-AQXMIR2Vi_HxV9EHyOnH0FR_kQb5vwW8IanxY3/s400/parallel_seqscan.png" width="400" /></a></div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
1. With increase in degree of parallelism (more parallel workers), the time to complete<br />
the execution reduces.<br />
2. Along with workers, master backend also participates in execution due to which you<br />
can see more time reduction in some cases.<br />
3. After certain point, increasing max parallel degree won't help.<br />
<br />
The cases we have seen in this blog are mostly the cases where parallel query helps by<br />
using the workers, however there exists some cases like when qualification is very cheap<br />
where it hurts or won't help even by employing more number of workers. There is<br />
more investigation needed to make sure that planner won't choose such plans for parallelism.</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com4tag:blogger.com,1999:blog-8673752770146776575.post-42905045035800538802015-08-08T06:10:00.000-07:002015-08-08T06:10:11.430-07:00Improved Writes in PostgreSQL For 9.6 (Part - 1)<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="color: #232323; font-family: Arial; font-size: 13px;">
Lately, PostgreSQL has gained attention because of numerous performance</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
improvements that are being done in various areas (like for 9.5 the major</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
areas as covered in my <a href="https://www.pgcon.org/2015/schedule/attachments/378_postgresql-95-scalability-perf-improvements.pdf" target="_blank">PGCon presentation</a> are Read operations, Sorting, </div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
plpgsql, new index type for data access, compression of full_page_writes),</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
however still there is more to be done to make it better than other commercial</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
RDBMS's and one of the important areas for improvements is Write</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
operations as shown in one of my previous posts (<a href="http://amitkapila16.blogspot.in/2015/04/write-scalability-in-postgresql.html" target="_blank">Write Scalability in</a></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<a href="http://amitkapila16.blogspot.in/2015/04/write-scalability-in-postgresql.html" target="_blank">PostgreSQL</a>). During my investigation of Write operations, I found that</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
there are locking bottlenecks during Write operations which is one of the</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
cause for limiting its performance and the one which contends most is</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
ProcArrayLock which is used during commit of transaction and for taking</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
Snapshots.</div>
<div style="color: #232323; font-family: Arial; font-size: 13px; min-height: 15px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
Removing the contention around ProcArrayLock gives a very good boost</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
in performance especially at higher client count and this work has been done</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
for PostgreSQL 9.6. To start with let us first discuss the improvement</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
in-terms of TPS (transactions per second) after this work. I have ran a pgbench</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
read-write (sort of tpcb) workload to compare the performance difference</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
with and without this commit in PostgreSQL on Intel m/c having 8 sockets,</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
64 cores (128 hardware threads), 500GB RAM and here is performance data</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
(running same tests on IBM POWER-8 m/c also shows similar gain)</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlN8sUcGcnY1KPrvr9mz1OKCoIysyXYAnvUhR-gAKUp_R_THvNvDybK-9vnDuWdfvY2w2sBRAkpWsCvUum9B66IDd5w1jlQJWVuRBlkxby_kYQgnEp9VJFoluP4Y7gHlcUgCxC2Cyezc4b/s1600/improved_writes_procarraylock.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlN8sUcGcnY1KPrvr9mz1OKCoIysyXYAnvUhR-gAKUp_R_THvNvDybK-9vnDuWdfvY2w2sBRAkpWsCvUum9B66IDd5w1jlQJWVuRBlkxby_kYQgnEp9VJFoluP4Y7gHlcUgCxC2Cyezc4b/s640/improved_writes_procarraylock.png" width="640" /></a></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
</div>
Non-default settings used in all the tests are:<br />
max_connections = 300<br />
shared_buffers = 8GB<br />
wal_buffers = 256MB<br />
min_wal_size=10GB<br />
max_wal_size=15GB<br />
checkpoint_timeout =35min<br />
maintenance_work_mem = 1GB<br />
checkpoint_completion_target = 0.9<br />
<div style="min-height: 15px;">
<br /></div>
The data is taken when all the data fits in shared_buffers as this work mainly helps<br />
such cases. The performance increase is visible at somewhat higher client count,<br />
at 64 clients we will see 30% improvement and at 256 clients, the performance<br />
improvement is 133%. At lower client-count (8 or 16 clients), there is not much<br />
difference (due to fluctuation, I see 1-2% difference, but I think for such cases this<br />
work doesn't help).<br />
<div style="min-height: 15px;">
<br /></div>
Now coming to the work done to improve the performance, presently for the<br />
correctness requirement of taking snapshot's in PostgreSQL, it enforces the <span style="color: black;">strict</span><br />
serialization of commits and rollbacks with snapshot-taking: it doesn't allow any<br />
transaction to exit the set of running transactions while a snapshot is being taken.<br />
To achieve the same, while taking snapshot it acquires ProcArrayLock in SHARED<br />
mode and each exiting transaction acquires it in EXCLUSIVE mode. So in this<br />
protocol, there are two different types of contention, one is between a backend<br />
which is trying to acquire a snapshot with backend trying to commit a transaction<br />
and second is among backends that are trying to commit a transaction at same<br />
time. The idea used in this work is to allow only one backend (we can call it as<br />
a group leader) at-a-time to take a ProcArrayLock and complete the work for<br />
all other transactions which are trying to commit the transactions at the same<br />
time. This helps in minimising the ProcArrayLock acquisition in EXCLUSIVE<br />
mode and which intern greatly reduces the contention around it.<br />
<div style="min-height: 15px;">
<br /></div>
Apart from the benefit this patch brings, it also opens up the opportunity<br />
to do more optimisations to reduce contention of various other locks like<br />
CLogControlLock and WALWriteLock etc. in PostgreSQL which I see as a huge<br />
benefit for Write operations. I hope to see more improvements for Write<br />
operations and cover them in future Blogs.<br />
<div style="min-height: 15px;">
<br /></div>
Last but not least, I would like to thank all who were involved in this work. Firstly<br />
I would like to thank my employer EnterpriseDB and <a href="http://rhaas.blogspot.in/" target="_blank">Robert Haas</a> who not only<br />
encouraged me to work in this area, but also helped in various stages of this<br />
Patch development. When I was in-middle of this work and wanted feedback<br />
and suggestions, a lot of people (during PGCon) shared their thoughts with me<br />
and among them who really helped me to move this work to a level where it<br />
can be presented to PostgreSQL community are Robert Haas, <a href="https://www.linkedin.com/pub/andres-freund/a5/430/372" target="_blank">Andres Freund</a><br />
and <a href="https://www.linkedin.com/in/simonat2ndquadrantdotcom" target="_blank">Simon Riggs</a>. In the end, I would also like to thank Pavan Deolasee who<br />
has reviewed this patch.<br />
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com2tag:blogger.com,1999:blog-8673752770146776575.post-78590045379489538042015-05-13T07:26:00.001-07:002015-06-05T21:59:20.382-07:00Extend Tar Format in pg_basebackup<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
As of now, one can't reliably use tar format to take backup on Windows</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
because it can't restore tablespaces data which is stored in form of symbolic</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
links in <data_directory>/pg_tblspc/. The reason for the same is that native</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
windows utilites are not able to create symbolic links while extracting files</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
from tar. It might be possible to create symbolic links if cygwin is installed</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
on your system, however we need this feature to work for native windows as</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
well.</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; min-height: 14px; text-align: justify;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
In PostgreSQL 9.5, a new feature (commit id - <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=72d422a5227ef6f76f412486a395aba9f53bf3f0">72d422a5</a>) to extend existing</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
tar format made it possible to reliably take the backup (in tar mode). </div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
From user perspective, there is nothing much that is changed to take the backup</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
except that tar format mode (--format=tar) in pg_basebackup (of the PostgreSQL</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
9.5 version) will only work with server version 9.5 or later. <span style="color: black;">This feature is mainly</span></div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
<span style="color: black;">required for windows, </span>but for the sake consistency it has been changed for all</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
platforms and also it should enable long (length greater than 99) target symbolic</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
link for tar format (I think the changes for same are still not done, but we can do</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
the same now as this feature is committed).</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; min-height: 14px; text-align: justify;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
The basic idea behind the feature is that it forms the tablespace map of</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
all the tablespace symbolic links that are present inside</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
<data_directory>/pg_tblspc/ and store the same in data_directory for</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
Exclusive backups (aka backups taken via pg_start_backup() and</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
pg_stop_backup() functions) and store in backup archive for Non-Exclusive</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
backups (aka backups taken by pg_basebackup).</div>
<div style="color: #232323; font-family: Arial; font-size: 12px; min-height: 14px; text-align: justify;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 12px; text-align: justify;">
The format of tablespace_map file is:</div>
<div style="font-family: Arial; font-size: 12px; text-align: justify;">
16384 E:\WorkSpace\PostgreSQL\master\tbs</div>
<div style="font-family: Arial; font-size: 12px; text-align: justify;">
16388 E:\WorkSpace\PostgreSQL\master\tbs 2 3</div>
<div style="font-family: Arial; font-size: 12px; min-height: 14px; text-align: justify;">
<br /></div>
<div style="font-family: Arial; font-size: 12px; text-align: justify;">
The tablespace symbolic links are restored during archive recovery and the</div>
<div style="font-family: Arial; font-size: 12px; text-align: justify;">
tablespace_map file will be renamed to tablespace_map.old at the end of</div>
<div style="font-family: Arial; font-size: 12px; text-align: justify;">
recovery similar to backup_label file.</div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com0tag:blogger.com,1999:blog-8673752770146776575.post-40019773514846473462015-04-17T20:38:00.000-07:002015-04-18T02:51:25.242-07:00Write Scalability in PostgreSQL<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="text-align: justify;">
<br /></div>
<br />
<div style="text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT;">I have ran some benchmark tests to see the Write performance/scalability in</span></div>
<div style="text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT;">PostgreSQL 9.5 and thought it would be good to share the same with others,</span><br />
<span style="color: #1a1a1a; font-family: ArialMT;">so writing this blog post.</span></div>
<span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">
<span style="color: #1a1a1a; font-family: ArialMT;"></span></span></span>
<br />
<div style="text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><br /></span></span></span></div>
<span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">
</span></span></span>
<div style="text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">I have ran a pgbench tests (TPC-B (sort of) load) to compare the performance</span></span></span></div>
<div style="color: #1a1a1a; font-family: ArialMT; text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">difference</span> </span></span><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">between different modes and scale factor in HEAD (</span></span></span><span style="font-family: ArialMT;"><span style="font-family: ArialMT;">e5f455f5</span></span><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">) on</span></span></span></div>
<div style="color: #1a1a1a; font-family: ArialMT; text-align: justify;">
<span style="font-family: ArialMT;"><span style="font-family: ArialMT;">IBM POWER-8 having </span></span><span style="font-family: ArialMT;"><span style="font-family: ArialMT;">24 cores, 192 hardware threads, 492GB RAM</span></span></div>
<div style="color: #1a1a1a; font-family: ArialMT; text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;"><span style="color: #1a1a1a; font-family: ArialMT;">and here are the performance results</span></span></span></div>
<div class="separator" style="clear: both; color: #1a1a1a; font-family: ArialMT; text-align: center;">
<a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a></div>
<div style="color: #1a1a1a; font-family: ArialMT; text-align: justify;">
<span style="color: #1a1a1a; font-family: ArialMT; font-size: medium;"><span style="color: #1a1a1a; font-family: ArialMT; font-size: medium;"><span style="color: #1a1a1a; font-family: ArialMT; font-size: medium;"><br /></span></span></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAdQy1KzhGgEjUA5h76K842Cgcf4igmR6j98xrQHG6WMyIGWsfSVrSq9URWpeCLhDboO1bt8u0oTG4ojaDp8y46wCrM0uf14mxdeHi6Q0fPVHr86Z2JbVGNv5ktDz-7yaI9xu51uLrSPAY/s1600/writes_sync_off.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: justify;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhAdQy1KzhGgEjUA5h76K842Cgcf4igmR6j98xrQHG6WMyIGWsfSVrSq9URWpeCLhDboO1bt8u0oTG4ojaDp8y46wCrM0uf14mxdeHi6Q0fPVHr86Z2JbVGNv5ktDz-7yaI9xu51uLrSPAY/s1600/writes_sync_off.png" height="240" width="320" /></a></div>
<span style="color: #1a1a1a; font-family: ArialMT; font-size: medium;"><span style="color: #1a1a1a; font-family: ArialMT; font-size: medium;"><span style="color: #1a1a1a; font-family: ArialMT; font-size: medium;"><br /></span></span></span></div>
<a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #1a1a1a; float: right; font-family: ArialMT; margin-bottom: 1em; margin-left: 1em; text-align: justify;"><br /></a><div style="color: #1a1a1a; font-family: ArialMT; text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; color: #1a1a1a; font-family: ArialMT; text-align: center;">
<a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: justify;"><br /></a></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
</div>
<div>
<div class="separator" style="clear: both; color: #232323; font-family: Arial; font-size: 13px; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqM2uOe52aYd3gX2Ypqvjh-6nYtlGxMinUmP4_O2OFUSDUbeUyRrljM5Yh99Uzp5h1OA0AbW-3qh3lXP2hd3ne2B6bznFoM62BzMN5IGNCGrJ0cJom4ckKGvuZYmzkZgAvU-nwuIZDrQK2/s1600/writes_sync_on.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em; text-align: justify;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqM2uOe52aYd3gX2Ypqvjh-6nYtlGxMinUmP4_O2OFUSDUbeUyRrljM5Yh99Uzp5h1OA0AbW-3qh3lXP2hd3ne2B6bznFoM62BzMN5IGNCGrJ0cJom4ckKGvuZYmzkZgAvU-nwuIZDrQK2/s1600/writes_sync_on.png" height="240" width="320" /></a></div>
<a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #232323; float: right; font-family: Arial; font-size: 13px; margin-bottom: 1em; margin-left: 1em;"></a><a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; color: #232323; float: right; font-family: Arial; font-size: 13px; margin-bottom: 1em; margin-left: 1em;"></a><br />
<div style="text-align: justify;">
<span style="color: #232323; font-family: Arial; font-size: 13px;"><br /></span></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<span style="text-align: justify;"><br /></span></div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<span style="text-align: justify;">Some of the default settings used in all the tests are:</span></div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
min_wal_size=15GB</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
max_wal_size=20GB</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
checkpoint_timeout =35min</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
maintenance_work_mem = 1GB</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
checkpoint_completion_target = 0.9</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
autovacuum=off</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
I have kept auto vacuum as off to reduce the fluctuation due to same and is</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
dropping and re-creating the database after each run. I have kept high values</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
of min_wal_size and max_wal_size to reduce the effect of checkpoints, probably</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
somewhat lower values could have served the purpose of this workload, but I</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
haven't tried it.</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
The data is mainly taken for 2 kind of modes (synchronous_commit = on | off) and</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
at 2 different scale factors to cover the cases when all the data fits in shared buffers</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
(scale_factor = 300) and when all the data can't fit in shared buffers, but can fit in</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
<a href="webkit-fake-url://F0A65A62-66DB-44ED-9E18-B6FFAC866053/image.tiff" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a>RAM (scale_factor = 3000).</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
First lets talk about synchronous_commit = off case, here when all the data fits in</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
with TPS being approximately 75 percent higher at 64 client-count as compare to 8</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
client count which doesn't look bad. When all the data doesn't fit in shared buffers,</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
but fit in RAM (scale_factor = 3000), we can see scalability upto 32 client-count with</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
TPS being 64 percent higher than at 8 client-count and then it falls there on.</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
One major difference in case of Writes when data doesn't fit in shared_buffers is</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
that backends performing transactions needs to write the dirty buffers themselves</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
when they are not able to find a clean buffer to read the page, this can hamper</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
the TPS.</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
Now let's talk about synchronous_commit = on case, here when all the data fits in</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
with TPS being approximately 189 percent higher at 64 client-count as compare to</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
8 client count which sounds good. When all the data doesn't fit in shared buffers,</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
but fit in RAM (scale_factor = 3000), we can see a pretty flat graph with some</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
scalability upto 16 client-count with TPS being approximately 22 percent higher than</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
at 8 client-count and then it stays as it is.</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
Here one point to note is that when the data fits in shared_buffers (scale_factor = 300),</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
TPS at higher client-count (64) in synchronous_commit = on mode becomes equivalent to </div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
TPS in synchronous_commit = off which suggests that there is no major contention</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
due to WAL writing in such loads.</div>
</div>
<div style="color: #232323; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
In synchronous_commit = on case, when the data doesn't fit in shared_buffers </div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
(scale_factor = 3000), the TPS is quite low and one reason is that backends might</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
be performing writes themselves, but not sure if the performance is so low just</div>
</div>
<div style="color: #232323; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
due to that reason as I have tried with different values of Bgwriter related parameters</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
<span style="color: #232323;">(</span>bgwriter_delay, bgwriter_lru_maxpages, bgwriter_lru_multiplier), but there is no</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
much difference.</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; min-height: 15px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
As per my knowledge, the locks that can lead to contention for this workload</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
are:</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
a. ProcArrayLock (used for taking snapshot and at transaction commit)</div>
</div>
<div style="color: #1a1a1a; font-family: Arial; text-align: left;">
<div style="text-align: justify;">
b. WALWriteLock (used for performing WALWrites)</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
c. CLOGControlLock (used to read and write transaction status)</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
d. <span style="font-family: Helvetica;">WALInsertLocks</span> (used for writing data to WAL buffer)</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; min-height: 18px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
I think among these ProcArrayLock and WALWriteLock are the candidates</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
which can be the reason for contention, but I haven't done any deep analysis</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
to find out the same.</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; min-height: 18px; text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
Now it could be that the bottleneck is due to multiple locks as was the case</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
for read operations which I have explained in my previous <a href="http://amitkapila16.blogspot.in/2015/01/read-scalability-in-postgresql-95.html">Read Scalability</a></div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
blog or it could be due to one of these locks. I think all this needs further</div>
</div>
<div style="color: #1a1a1a; font-family: 'Arial Unicode MS'; text-align: left;">
<div style="text-align: justify;">
analysis and work. Thats all what I want to say for now.</div>
</div>
<div style="color: #1a1a1a; font-family: ArialMT;">
</div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com7tag:blogger.com,1999:blog-8673752770146776575.post-558184696220754462015-03-17T07:39:00.000-07:002015-03-17T07:40:11.039-07:00Different Approaches for MVCC used in well known Databases<div dir="ltr" style="text-align: left;" trbidi="on">
Database Management Systems uses MVCC to avoid the problem of<br />
Writers blocking Readers and vice-versa, by making use of multiple<br />
versions of data.<br />
<br />
There are essentially two approaches to multi-version concurrency.<br />
<br />
<b>Approaches for MVCC</b><br />
The first approach is to store multiple versions of records in the<br />
database, and garbage collect records when they are no longer<br />
required. This is the approach adopted by PostgreSQL and<br />
Firebird/Interbase. SQL Server also uses somewhat similar approach<br />
with the difference that old versions are stored in tempdb<br />
(database different from main database).<br />
<br />
The second approach is to keep only the latest version of data in<br />
the database, but reconstruct older versions of data dynamically<br />
as required by using undo. This is approach adopted by Oracle<br />
and MySQL/InnoDB<br />
<br />
<br />
<b>MVCC in PostgreSQL</b><br />
In PostgreSQL, when a row is updated, a new version (called a tuple)<br />
of the row is created and inserted into the table. The previous version<br />
is provided a pointer to the new version. The previous version is<br />
marked “expired", but remains in the database until it is garbage collected.<br />
<br />
In order to support multi-versioning, each tuple has additional data<br />
recorded with it:<br />
xmin - The ID of the transaction that inserted/updated the<br />
row and created this tuple.<br />
xmax - The transaction that deleted the row, or created a<br />
new version of this tuple. Initially this field is null.<br />
<br />
Transaction status is maintained in CLOG which resides in $Data/pg_clog.<br />
This table contains two bits of status information for each transaction;<br />
the possible states are in-progress, committed, or aborted.<br />
<br />
PostgreSQL does not undo changes to database rows when a transaction<br />
aborts - it simply marks the transaction as aborted in CLOG . A PostgreSQL<br />
table therefore may contain data from aborted transactions.<br />
<br />
A Vacuum cleaner process is provided to garbage collect expired/aborted<br />
versions of a row. The Vacuum Cleaner also deletes index entries<br />
associated with tuples that are garbage collected.<br />
<br />
A tuple is visible if its xmin is valid and xmax is not.<br />
“Valid" means “either committed or the current transaction".<br />
To avoid consulting the CLOG table repeatedly, PostgreSQL maintains<br />
status flags in the tuple that indicate whether the tuple is “known committed"<br />
or “known aborted".<br />
<br />
<br />
<b>MVCC in Oracle</b><br />
Oracle maintain old versions in rollback segments (also known as<br />
'undo log'). A transaction ID is not a sequential number; instead, it is<br />
made of a set of numbers that points to the transaction entry (slot) in a<br />
Rollback segment header.<br />
<br />
Rollback segments have the property that new transactions can reuse<br />
storage and transaction slots used by older transactions that are<br />
committed or aborted.<br />
This automatic reuse facility enables Oracle to manage large numbers<br />
of transactions using a finite set of rollback segments.<br />
<br />
The header block of the rollback segment is used as a transaction table.<br />
Here the status of a transaction is maintained (called System Change Number,<br />
or SCN, in Oracle). Rather than storing a transaction ID with each row<br />
in the page, Oracle saves space by maintaining an array of unique transactions<br />
IDs separately within the page, and stores only the offset of this array with<br />
the row.<br />
<br />
Along with each transaction ID, Oracle stores a pointer to the last undo record<br />
created by the transaction for the page. Not only are table rows stored in this<br />
way, Oracle employs the same techniques when storing index rows. This is<br />
one of the major difference between PostgreSQL and Oracle.<br />
<br />
When an Oracle transaction starts, it makes a note of the current SCN. When<br />
reading a table or an index page, Oracle uses the SCN number to determine if<br />
the page contains the effects of transactions that should not be visible to the<br />
current transaction. Oracle checks the commit status of a transaction by<br />
looking up the associated Rollback segment header, but, to save time, the first<br />
time a transaction is looked up, its status is recorded in the page itself to avoid<br />
future lookups.<br />
<br />
If the page is found to contain the effects of invisible transactions, then Oracle<br />
recreates an older version of the page by undoing the effects of each such<br />
transaction. It scans the undo records associated with each transaction and<br />
applies them to the page until the effects of those transactions are removed.<br />
The new page created this way is then used to access the tuples within it.<br />
<br />
Record Header in Oracle<br />
A row header never grows, always a fixed size. For non-cluster tables,<br />
the row header is 3 bytes. One byte is used to store flags, one byte to<br />
indicate if the row is locked (for example because it's updated but not<br />
committed), and one byte for the column count.<br />
<br />
<br />
<b>MVCC in SQL Server</b><br />
Snapshot isolation and read committed using row versioning are enabled<br />
at the database level. Only databases that require this option must enable<br />
it and incur the overhead associated with it.<br />
<br />
Versioning effectively starts with a copy-on-write mechanism that is<br />
invoked when a row is modified or deleted. Row versioning–based<br />
transactions can effectively "view" the consistent version of the data<br />
from these previous row versions.<br />
<br />
Row versions are stored within the version store that is housed within the<br />
tempdb database. More specifically, when a record in a table or index is<br />
modified, the new record is stamped with the "sequence_number" of the<br />
transaction that is performing the modification.<br />
The old version of the record is copied to the version store, and the new record<br />
contains a pointer to the old record in the version store.<br />
If multiple long-running transactions exist and multiple "versions" are required,<br />
records in the version store might contain pointers to even earlier versions of<br />
the row.<br />
<br />
Version store cleanup in SQL Server<br />
SQL Server manages the version store size automatically, and maintains a<br />
cleanup thread to make sure it does not keep versioned rows around longer<br />
than needed. For queries running under Snapshot Isolation, the version<br />
store retains the row versions until the transaction that modified the data<br />
completes and the transactions containing any statements that reference the<br />
modified data complete. For SELECT statements running under<br />
Read Committed Snapshot Isolation, a particular row version is no longer<br />
required, and is removed, once the SELECT statement has executed.<br />
<br />
If tempdb actually runs out of free space, SQL Server calls the cleanup<br />
function and will increase the size of the files, assuming we configured the<br />
files for auto-grow. If the disk gets so full that the files cannot grow,<br />
SQL Server will stop generating versions. If that happens, any snapshot<br />
query that needs to read a version that was not generated due to space<br />
constraints will fail.<br />
<br />
Record Header in SQL Server<br />
4 bytes long<br />
- two bytes of record metadata (record type)<br />
- two bytes pointing forward in the record to the NULL bitmap. This is<br />
offset to some actual data in record (fixed length columns).<br />
<br />
Versioning tag - this is a 14-byte structure that contains a timestamp<br />
plus a pointer into the version store in tempdb.<br />
Here timestamp is trasaction_seq_number, the only time that rows get<br />
versioning info added to record is when it’s needed to support a<br />
versioning operation.<br />
<br />
As the versioning information is optional, I think that is the reason<br />
they could store this info in index records as well without much<br />
impact.<br />
<br />
<table border="2" bordercolor="#0033FF" cellpadding="3" cellspacing="3" style="background-color: #99ffff; width: 100%px;">
<tbody>
<tr>
<th>Database</th>
<th>PostgreSQL</th>
<th>Oracle</th>
<th>SQL Server</th>
</tr>
<tr>
<td>Storage for Old Versions</td>
<td>In the main Segment
(Heap/Index)</td>
<td>In the separate
segment (Rollback
Segment/Undo)</td>
<td>In the separate
database (tempdb
– known as
version store)</td>
</tr>
<tr>
<td>Size of Tuple
Header (bytes)</td>
<td>24</td>
<td>3</td>
<td>Fixed – 4
Variable - 14</td>
</tr>
<tr>
<td>Clean up</td>
<td>Vacuum</td>
<td>System Monitor
Process (SMON)</td>
<td>Ghost Cleanup
task</td>
</tr>
</tbody></table>
<br />
<b>Conclusion of study</b><br />
As other databases store version/visibility information in index, that makes<br />
index cleanup easier (as it is no longer tied to heap for visibility information).<br />
The advantage for not storing the visibility information in index is that for<br />
Delete operations, we don't need to perform an index delete and probably the<br />
size of index record could be somewhat smaller.<br />
<br />
Oracle and probably MySQL (Innodb) needs to write the record in undo<br />
segment for Insert statement whereas in PostgreSQL/SQL Server, the new<br />
record version is created only when a row is modified or deleted.<br />
<br />
Only changed values are written to undo whereas PostgreSQL/SQL Server<br />
creates a complete new tuple for modified row. This avoids bloat in the main<br />
heap segment.<br />
<br />
Both Oracle and SQL Server has some way to restrict the growth of version<br />
information whereas PostgreSQL/PPAS doesn't have any way.</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com16tag:blogger.com,1999:blog-8673752770146776575.post-52163566232466026602015-01-24T02:36:00.000-08:002015-01-26T05:27:06.550-08:00Read Scalability in PostgreSQL 9.5<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="color: #232323; font-family: Arial; font-size: 13px;">
In PostgreSQL 9.5, we will see a boost in scalability for read workload<br />
when the data can fit in RAM. I have ran a pgbench read-only load to<br />
compare the performance difference between 9.4 and HEAD (62f5e447)<br />
on <span style="color: black;">IBM POWER-8 having </span>24 cores, 192 hardware threads, 492GB RAM<br />
<span style="color: #232323;">and here is the </span>performance data</div>
<div>
<div class="separator" style="clear: both; color: #232323; font-family: Arial; font-size: 13px; text-align: center;">
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGylr_qkYbhUBFo4uemHgtjKxKrN9bTNDrERdhaKgTeHgZCf_U0hgtuNf4pA6GzYmvDqvsKOdH15jhw_zopnfDkVMlDJ69hZJtW710aq7HDe9NxlhnyyPjYlHeOeOH5jz-ZE9FZQPWHVeZ/s1600/perf_data_sc_300.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGylr_qkYbhUBFo4uemHgtjKxKrN9bTNDrERdhaKgTeHgZCf_U0hgtuNf4pA6GzYmvDqvsKOdH15jhw_zopnfDkVMlDJ69hZJtW710aq7HDe9NxlhnyyPjYlHeOeOH5jz-ZE9FZQPWHVeZ/s1600/perf_data_sc_300.png" height="200" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.blogger.com/blogger.g?blogID=8673752770146776575" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.blogger.com/blogger.g?blogID=8673752770146776575" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"></a></div>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCrRUv7qVPUT2dEocVUDRauPPGzoPcEZ5KgiySPFySD4kqlg_A0beaTXXD76jFrmo_FCueA3vzhpTYGb8mR38rfxXuNYhNcbARP5_7loi-N8VgdN54c3uDo-c_sh3sQ1iRkBdw6Ib8j1jG/s1600/perf_data_sc_1000.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCrRUv7qVPUT2dEocVUDRauPPGzoPcEZ5KgiySPFySD4kqlg_A0beaTXXD76jFrmo_FCueA3vzhpTYGb8mR38rfxXuNYhNcbARP5_7loi-N8VgdN54c3uDo-c_sh3sQ1iRkBdw6Ib8j1jG/s1600/perf_data_sc_1000.png" height="202" width="400" /></a><br />
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
</div>
<div style="text-align: justify;">
The data is mainly taken for 2 kind of workloads, when all the data fits</div>
<div style="text-align: justify;">
in shared buffers (scale_factor = 300) and when all the data can't fit in</div>
<div style="text-align: justify;">
shared buffers, but can fit in RAM (scale_factor = 1000).</div>
<div style="min-height: 15px;">
<div style="text-align: justify;">
<br /></div>
</div>
<div style="text-align: justify;">
First lets talk about 300 scale factor case, in 9.4 it peaks at 32 clients,</div>
<div style="text-align: justify;">
now it peaks at 64 clients and we can see the performance improvement</div>
<div style="text-align: justify;">
upto (~98%) and it is better in all cases at higher client count starting from</div>
<div style="text-align: justify;">
32 clients. Now the main work which lead to this improvement is<br />
commit - <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ab5194e6f617a9a9e7aadb3dd1cee948a42d0755">ab5194e6</a> (Improve LWLock scalability). The previous implementation<br />
has a bottleneck around spin locks that were acquired for LWLock<br />
Acquisition and Release and the implantation for 9.5 has changed the<br />
LWLock implementation to use atomic operations to manipulate the state.<br />
Thanks to Andres Freund (and according to me the credit goes to reviewers<br />
(Robert Haas and myself) as well who have reviewed multiple versions<br />
of this patch) author of this patch due to whom many PostgreSQL users will<br />
be happy.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Now lets discuss about 1000 scale factor case, in this case, we could</div>
<div style="text-align: justify;">
see the good performance improvement (~25%) even at 32 clients and it</div>
<div style="text-align: justify;">
went upto (~96%) at higher client count, in this case also where in 9.4</div>
<div style="text-align: justify;">
it was peaking at 32 client count, now it peaks at 64 client count and</div>
<div style="text-align: justify;">
the performance is better at all higher client counts. The main work</div>
<div style="text-align: justify;">
which lead to this improvement is commit id <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5d7962c6797c0baae9ffb3b5b9ac0aec7b598bc3" rel="nofollow">5d7962c6</a> (Change locking</div>
<div style="text-align: justify;">
regimen around buffer replacement) and commit id <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892">3acc10c9</a> (Increase</div>
<div style="text-align: justify;">
the number of buffer mapping partitions to 128). In this case there were</div>
<div style="text-align: justify;">
mainly 2 bottlenecks (a) a BufFreeList LWLock was getting acquired to</div>
<div style="text-align: justify;">
find a free buffer for a page (to find free buffer, it needs to execute</div>
<div style="text-align: justify;">
clock sweep) which becomes bottleneck when many clients try to perform the</div>
<div style="text-align: justify;">
same action simultaneously (b) to change the association of buffer in</div>
<div style="text-align: justify;">
buffer mapping hash table a LWLock is acquired on a hash partition to</div>
<div style="text-align: justify;">
which the buffer to be associated belongs and as there were just 16</div>
<div style="text-align: justify;">
such partitions, there was huge contention when multiple clients starts</div>
<div style="text-align: justify;">
operating on same partition. To reduce the bottleneck due to (a), used</div>
<div style="text-align: justify;">
a spinlock which is held just long enough to pop the freelist or advance</div>
<div style="text-align: justify;">
the clock sweep hand, and then released. If we need to advance the</div>
<div style="text-align: justify;">
clock sweep further, we reacquire the spinlock once per buffer. To reduce</div>
<div style="text-align: justify;">
the bottleneck due to (b), increase the buffer partitions to 128. The crux</div>
<div style="text-align: justify;">
of this improvement is that we had to resolve both the bottlenecks (a and b)</div>
<div style="text-align: justify;">
together to see a major improvement in scalability. The initial patch for</div>
<div style="text-align: justify;">
this improvement is prepared by me and then Robert Haas extracted the</div>
<div style="text-align: justify;">
important part of patch and committed the same. Many thanks to both</div>
<div style="text-align: justify;">
Robert Haas and Andres Freund who not only reviewed the patch, but</div>
<div style="text-align: justify;">
given lot of useful suggestions during this work.</div>
<div style="min-height: 15px;">
<div style="text-align: justify;">
<br /></div>
</div>
<a href="https://www.blogger.com/blogger.g?blogID=8673752770146776575" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><a href="https://www.blogger.com/blogger.g?blogID=8673752770146776575" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a><br />
<div style="text-align: justify;">
During the work on improvements in buffer management, I noticed that</div>
<div style="text-align: justify;">
the next bigger bottleneck that could buy us reasonably good improvement</div>
<div style="text-align: justify;">
in read workloads is in dynamic hash tables used to manage shared buffers,</div>
<div style="text-align: justify;">
so improving the concurrency of dynamic hash tables could help further</div>
<div style="text-align: justify;">
improving the read operation. There was some discussion about using</div>
<div style="text-align: justify;">
concurrent hash table for shared buffers (patch by Robert Haas), but still</div>
<div style="text-align: justify;">
it has not materialized.</div>
<div style="text-align: justify;">
<br /></div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
</div>
</div>
<div style="color: #232323; font-family: Arial; font-size: 13px;">
</div>
</div>
Amit Kapilahttp://www.blogger.com/profile/01948926447381079550noreply@blogger.com17