PostgreSQL and Databases in general: 2015

Sunday 29 November 2015

Parallel Sequential Scans in play

Parallelism is now reality in PostgreSQL. With 9.6, I hope we will see many
different form of queries that can use parallelism to execute. For now, I will
limit this discussion to what we can already do, which is Parallel Sequential
Scans.

Parallel Sequential Scans are used to scan a relation parallely with the help of
background workers which in turns improve the performance of such scans. I
will discuss about the scenarios where user can expect a performance boost
due to this feature later in this blog, but first let us understand the basic feature
and how it works. Three new GUC parameters have been added to tune the
usage of this feature.

max_parallel_degree - This is used to set the maximum number of workers that
can be used for an individual parallel operation. It is very well possible that the
requested number of workers are not available at execution time. Parallel workers
are taken from the pool of processes established by max_worker_processes which
means that value of max_parallel_degree should be lesser than max_worker_processes.
It might not be useful to set the value of this parameter more than the number of CPU
count on your system.

parallel_tuple_cost - This is used by planner to estimate the cost of transferring a
tuple from parallel worker process to master backend. The default is 0.1. The more
the number of tuples that needs to be passed from worker backend processes to
master backend process, the more this cost will be and more overall cost of
parallel sequential scan plan.

parallel_setup_cost - This is used by planner to estimate the cost of launching parallel
worker processes and setting up dynamic shared memory to communicate.
The default is 1000.

Now let us see the simple example to demonstrate how parallel sequential scan works:

 create table tbl_parallel_test(c1 int, c2 char(1000));   
 insert into tbl_parallel_test values(generate_series(1,1000000),'aaaaa');   
 Analyze tbl_parallel_test;   
 Explain analyze select * from tbl_parallel_test where c1 < 10000 and  
 c2 like '%bb%';   
               QUERY PLAN              
  -------------------------------------------------------------------------------------------------------------   
  Seq Scan on tbl_parallel_test  
          (cost=0.00..157858.09 rows=1 width=1008)  
          (actual time=378.414..378.414 rows=0 loops=1)   
   Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))   
   Rows Removed by Filter: 1000000   
  Planning time: 0.075 ms   
  Execution time: 378.431 ms   
  (5 rows)

Set the max parallel degree to enable the use of parallelism in queries.

 set max_parallel_degree = 6;  
 Explain analyze select * from tbl_parallel_test where c1 < 10000  
 and c2 like '%bb%';  
                                QUERY PLAN                    
 -------------------------------------------------------------------------------------------------------------  
  Gather (cost=1000.00..29701.57 rows=1 width=1008)   
        (actual time=182.708..182.708 rows=0 loops=1)  
   Number of Workers: 5  
   -> Parallel Seq Scan on tbl_parallel_test  
         (cost=0.00..28701.47 rows=1 width=1008)  
         (actual time=179.496..1081.120 rows=0 loops=1)  
      Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))  
      Rows Removed by Filter: 1000000  
  Planning time: 0.078 ms  
  Execution time: 200.610 ms  
 (7 rows)

Here, we can see how changing max_parallel_degree allows the usage of parallel workers
to perform parallel sequential scans. We can notice in above example that even though we
have set max_parallel_degree as 6, still it uses 5 workers and the reason for same is that
currently the parallel workers are choosen based on size of relation.

Next, let us discuss about usage of functions in parallel query. A new clause PARALLEL
is added to the CREATE FUNCTION statement. There are three valid values that can be
used by user with this clause.

1. PARALLEL Unsafe - This indicates that the function can't be executed in parallel mode
and the presence of such a function in a SQL statement forces a serial execution plan.
2. PARALLEL Restricted - This indicates that the function can be executed in parallel mode,
but the execution is restricted to parallel group leader. As of now, if the qualification for any
particular relation has anything that is parallel restricted, that relation won't be chosen for
parallelism.
3. Parallel Safe - This indicates that the function is safe to run in parallel mode without
restriction.

The default value for function is PARALLEL Unsafe.

Now let us see the impact of using Parallel Safe and Unsafe function in the queries. I will
continue using the query used in previous example to explain the concept.

Create a Parallel Safe function

 create or replace function calc_factorial(a integer, fact_val integer)  
 returns integer   
  as $$   
  begin   
    perform (fact_val)!;   
    return a;   
  end;   
  $$ language plpgsql PARALLEL Safe;

Use it in query

 Explain analyze select * from tbl_parallel_test where  
               c1 < calc_factorial(10000, 10)   
               and c2 like '%bb%';   
        QUERY PLAN   
  --------------------------------------------------------------------------------   
  Gather (cost=1000.00..75154.99 rows=1 width=1008)   
     (actual time=120566.456..120566.456 rows=0 loops=1)   
   Number of Workers: 5   
   -> Parallel Seq Scan on tbl_parallel_test   
      (cost=0.00..74154.89 rows=1 width=1008)   
      (actual time=119635.421..359721.498 rows=0 loops=1)   
    Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))   
    Rows Removed by Filter: 1000000   
  Planning time: 54.904 ms   
  Execution time: 120622.631 ms   
  (7 rows)

Here we can see that Parallel Plan is chosen and the parallel safe function
is pushed to workers for evaluation of quals.

Now lets change that function as Parallel Unsafe and see how the above
query behaves.

  Alter Function calc_factorial(integer, integer) PARALLEL Unsafe;   
  Explain analyze select * from tbl_parallel_test where  
              c1 < calc_factorial(10000, 10)   
              and c2 like '%bb%';   
         QUERY PLAN   
  --------------------------------------------------------------------------------   
  Seq Scan on tbl_parallel_test   
     (cost=0.00..407851.91 rows=1 width=1008)   
     (actual time=33166.138..33166.138 rows=0 loops=1)   
   Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))   
   Rows Removed by Filter: 1000000   
  Planning time: 0.162 ms   
  Execution time: 33166.208 ms   
  (5 rows)

So using parallel unsafe functions in queries would lead to serial plans.

Next, let us see the Performance characteristics of Parallelism:

Non-default settings used to collect performance data:

 shared_buffers=32GB; min_wal_size=5GB; max_wal_size=10GB  
 checkpoint_timeout =30min; max_connections=300;  
 max_worker_processes=100;

Test setup

 create table tbl_perf(c1 int, c2 char(1000));  
 insert into tbl_perf values(generate_series(1,30000000),'aaaaa');  
 Explain analyze select c1 from tbl_perf where  
              c1 > calc_factorial($1,10) and  
              c2 like '%aa%';

The function calc_factorial is same as used in previous example and the values passed
to it are such that the desired percentage of rows can be selected. Example

 --"to select 1% of rows, below query can be used"  
 Explain analyze select c1 from tbl_perf where  
              c1 > calc_factorial(29700000,10) and  
              c2 like '%aa%';"  
 --"to select 10% of rows, below query can be used"  
 Explain analyze select c1 from tbl_perf where  
              c1 > calc_factorial(27000000,10) and  
              c2 like '%aa%';"  
 --"to select 25% of rows, below query can be used"  
 Explain analyze select c1 from tbl_perf where  
              c1 > calc_factorial(22500000,10) and  
              c2 like '%aa%';"

Performance Data -

1. With increase in degree of parallelism (more parallel workers), the time to complete
the execution reduces.
2. Along with workers, master backend also participates in execution due to which you
can see more time reduction in some cases.
3. After certain point, increasing max parallel degree won't help.

The cases we have seen in this blog are mostly the cases where parallel query helps by
using the workers, however there exists some cases like when qualification is very cheap
where it hurts or won't help even by employing more number of workers. There is
more investigation needed to make sure that planner won't choose such plans for parallelism.

Saturday 8 August 2015

Improved Writes in PostgreSQL For 9.6 (Part - 1)

Lately, PostgreSQL has gained attention because of numerous performance

improvements that are being done in various areas (like for 9.5 the major

areas as covered in my PGCon presentation are Read operations, Sorting,

plpgsql, new index type for data access, compression of full_page_writes),

however still there is more to be done to make it better than other commercial

RDBMS's and one of the important areas for improvements is Write

operations as shown in one of my previous posts (Write Scalability in

PostgreSQL). During my investigation of Write operations, I found that

there are locking bottlenecks during Write operations which is one of the

cause for limiting its performance and the one which contends most is

ProcArrayLock which is used during commit of transaction and for taking

Snapshots.

Removing the contention around ProcArrayLock gives a very good boost

in performance especially at higher client count and this work has been done

for PostgreSQL 9.6. To start with let us first discuss the improvement

in-terms of TPS (transactions per second) after this work. I have ran a pgbench

read-write (sort of tpcb) workload to compare the performance difference

with and without this commit in PostgreSQL on Intel m/c having 8 sockets,

64 cores (128 hardware threads), 500GB RAM and here is performance data

(running same tests on IBM POWER-8 m/c also shows similar gain)

Non-default settings used in all the tests are:
max_connections = 300
shared_buffers = 8GB
wal_buffers = 256MB
min_wal_size=10GB
max_wal_size=15GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9

The data is taken when all the data fits in shared_buffers as this work mainly helps
such cases. The performance increase is visible at somewhat higher client count,
at 64 clients we will see 30% improvement and at 256 clients, the performance
improvement is 133%. At lower client-count (8 or 16 clients), there is not much
difference (due to fluctuation, I see 1-2% difference, but I think for such cases this
work doesn't help).

Now coming to the work done to improve the performance, presently for the
correctness requirement of taking snapshot's in PostgreSQL, it enforces the strict
serialization of commits and rollbacks with snapshot-taking: it doesn't allow any
transaction to exit the set of running transactions while a snapshot is being taken.
To achieve the same, while taking snapshot it acquires ProcArrayLock in SHARED
mode and each exiting transaction acquires it in EXCLUSIVE mode. So in this
protocol, there are two different types of contention, one is between a backend
which is trying to acquire a snapshot with backend trying to commit a transaction
and second is among backends that are trying to commit a transaction at same
time. The idea used in this work is to allow only one backend (we can call it as
a group leader) at-a-time to take a ProcArrayLock and complete the work for
all other transactions which are trying to commit the transactions at the same
time. This helps in minimising the ProcArrayLock acquisition in EXCLUSIVE
mode and which intern greatly reduces the contention around it.

Apart from the benefit this patch brings, it also opens up the opportunity
to do more optimisations to reduce contention of various other locks like
CLogControlLock and WALWriteLock etc. in PostgreSQL which I see as a huge
benefit for Write operations. I hope to see more improvements for Write
operations and cover them in future Blogs.

Last but not least, I would like to thank all who were involved in this work. Firstly
I would like to thank my employer EnterpriseDB and Robert Haas who not only
encouraged me to work in this area, but also helped in various stages of this
Patch development. When I was in-middle of this work and wanted feedback
and suggestions, a lot of people (during PGCon) shared their thoughts with me
and among them who really helped me to move this work to a level where it
can be presented to PostgreSQL community are Robert Haas, Andres Freund
and Simon Riggs. In the end, I would also like to thank Pavan Deolasee who
has reviewed this patch.

Wednesday 13 May 2015

Extend Tar Format in pg_basebackup

As of now, one can't reliably use tar format to take backup on Windows

because it can't restore tablespaces data which is stored in form of symbolic

links in <data_directory>/pg_tblspc/. The reason for the same is that native

windows utilites are not able to create symbolic links while extracting files

from tar. It might be possible to create symbolic links if cygwin is installed

on your system, however we need this feature to work for native windows as

well.

In PostgreSQL 9.5, a new feature (commit id - 72d422a5) to extend existing

tar format made it possible to reliably take the backup (in tar mode).

From user perspective, there is nothing much that is changed to take the backup

except that tar format mode (--format=tar) in pg_basebackup (of the PostgreSQL

9.5 version) will only work with server version 9.5 or later. This feature is mainly

required for windows, but for the sake consistency it has been changed for all

platforms and also it should enable long (length greater than 99) target symbolic

link for tar format (I think the changes for same are still not done, but we can do

the same now as this feature is committed).

The basic idea behind the feature is that it forms the tablespace map of

all the tablespace symbolic links that are present inside

<data_directory>/pg_tblspc/ and store the same in data_directory for

Exclusive backups (aka backups taken via pg_start_backup() and

pg_stop_backup() functions) and store in backup archive for Non-Exclusive

backups (aka backups taken by pg_basebackup).

The format of tablespace_map file is:

16384 E:\WorkSpace\PostgreSQL\master\tbs

16388 E:\WorkSpace\PostgreSQL\master\tbs 2 3

The tablespace symbolic links are restored during archive recovery and the

tablespace_map file will be renamed to tablespace_map.old at the end of

recovery similar to backup_label file.

Friday 17 April 2015

Write Scalability in PostgreSQL

I have ran some benchmark tests to see the Write performance/scalability in

PostgreSQL 9.5 and thought it would be good to share the same with others,
so writing this blog post.

I have ran a pgbench tests (TPC-B (sort of) load) to compare the performance

difference between different modes and scale factor in HEAD (e5f455f5) on

IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM

and here are the performance results

Some of the default settings used in all the tests are:

min_wal_size=15GB

max_wal_size=20GB

checkpoint_timeout =35min

maintenance_work_mem = 1GB

checkpoint_completion_target = 0.9

autovacuum=off

I have kept auto vacuum as off to reduce the fluctuation due to same and is

dropping and re-creating the database after each run. I have kept high values

of min_wal_size and max_wal_size to reduce the effect of checkpoints, probably

somewhat lower values could have served the purpose of this workload, but I

haven't tried it.

The data is mainly taken for 2 kind of modes (synchronous_commit = on | off) and

at 2 different scale factors to cover the cases when all the data fits in shared buffers

(scale_factor = 300) and when all the data can't fit in shared buffers, but can fit in

RAM (scale_factor = 3000).

First lets talk about synchronous_commit = off case, here when all the data fits in

shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count

with TPS being approximately 75 percent higher at 64 client-count as compare to 8

client count which doesn't look bad. When all the data doesn't fit in shared buffers,

but fit in RAM (scale_factor = 3000), we can see scalability upto 32 client-count with

TPS being 64 percent higher than at 8 client-count and then it falls there on.

One major difference in case of Writes when data doesn't fit in shared_buffers is

that backends performing transactions needs to write the dirty buffers themselves

when they are not able to find a clean buffer to read the page, this can hamper

the TPS.

Now let's talk about synchronous_commit = on case, here when all the data fits in

shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count

with TPS being approximately 189 percent higher at 64 client-count as compare to

8 client count which sounds good. When all the data doesn't fit in shared buffers,

but fit in RAM (scale_factor = 3000), we can see a pretty flat graph with some

scalability upto 16 client-count with TPS being approximately 22 percent higher than

at 8 client-count and then it stays as it is.

Here one point to note is that when the data fits in shared_buffers (scale_factor = 300),

TPS at higher client-count (64) in synchronous_commit = on mode becomes equivalent to

TPS in synchronous_commit = off which suggests that there is no major contention

due to WAL writing in such loads.

In synchronous_commit = on case, when the data doesn't fit in shared_buffers

(scale_factor = 3000), the TPS is quite low and one reason is that backends might

be performing writes themselves, but not sure if the performance is so low just

due to that reason as I have tried with different values of Bgwriter related parameters

(bgwriter_delay, bgwriter_lru_maxpages, bgwriter_lru_multiplier), but there is no

much difference.

As per my knowledge, the locks that can lead to contention for this workload

are:

a. ProcArrayLock (used for taking snapshot and at transaction commit)

b. WALWriteLock (used for performing WALWrites)

c. CLOGControlLock (used to read and write transaction status)

d. WALInsertLocks (used for writing data to WAL buffer)

I think among these ProcArrayLock and WALWriteLock are the candidates

which can be the reason for contention, but I haven't done any deep analysis

to find out the same.

Now it could be that the bottleneck is due to multiple locks as was the case

for read operations which I have explained in my previous Read Scalability

blog or it could be due to one of these locks. I think all this needs further

analysis and work. Thats all what I want to say for now.

Tuesday 17 March 2015

Different Approaches for MVCC used in well known Databases

Database Management Systems uses MVCC to avoid the problem of
Writers blocking Readers and vice-versa, by making use of multiple
versions of data.

There are essentially two approaches to multi-version concurrency.

Approaches for MVCC
The first approach is to store multiple versions of records in the
database, and garbage collect records when they are no longer
required. This is the approach adopted by PostgreSQL and
Firebird/Interbase. SQL Server also uses somewhat similar approach
with the difference that old versions are stored in tempdb
(database different from main database).

The second approach is to keep only the latest version of data in
the database, but reconstruct older versions of data dynamically
as required by using undo. This is approach adopted by Oracle
and MySQL/InnoDB

MVCC in PostgreSQL
In PostgreSQL, when a row is updated, a new version (called a tuple)
of the row is created and inserted into the table. The previous version
is provided a pointer to the new version. The previous version is
marked “expired", but remains in the database until it is garbage collected.

In order to support multi-versioning, each tuple has additional data
recorded with it:
xmin - The ID of the transaction that inserted/updated the
row and created this tuple.
xmax - The transaction that deleted the row, or created a
new version of this tuple. Initially this field is null.

Transaction status is maintained in CLOG which resides in $Data/pg_clog.
This table contains two bits of status information for each transaction;
the possible states are in-progress, committed, or aborted.

PostgreSQL does not undo changes to database rows when a transaction
aborts - it simply marks the transaction as aborted in CLOG . A PostgreSQL
table therefore may contain data from aborted transactions.

A Vacuum cleaner process is provided to garbage collect expired/aborted
versions of a row. The Vacuum Cleaner also deletes index entries
associated with tuples that are garbage collected.

A tuple is visible if its xmin is valid and xmax is not.
“Valid" means “either committed or the current transaction".
To avoid consulting the CLOG table repeatedly, PostgreSQL maintains
status flags in the tuple that indicate whether the tuple is “known committed"
or “known aborted".

MVCC in Oracle
Oracle maintain old versions in rollback segments (also known as
'undo log'). A transaction ID is not a sequential number; instead, it is
made of a set of numbers that points to the transaction entry (slot) in a
Rollback segment header.

Rollback segments have the property that new transactions can reuse
storage and transaction slots used by older transactions that are
committed or aborted.
This automatic reuse facility enables Oracle to manage large numbers
of transactions using a finite set of rollback segments.

The header block of the rollback segment is used as a transaction table.
Here the status of a transaction is maintained (called System Change Number,
or SCN, in Oracle). Rather than storing a transaction ID with each row
in the page, Oracle saves space by maintaining an array of unique transactions
IDs separately within the page, and stores only the offset of this array with
the row.

Along with each transaction ID, Oracle stores a pointer to the last undo record
created by the transaction for the page. Not only are table rows stored in this
way, Oracle employs the same techniques when storing index rows. This is
one of the major difference between PostgreSQL and Oracle.

When an Oracle transaction starts, it makes a note of the current SCN. When
reading a table or an index page, Oracle uses the SCN number to determine if
the page contains the effects of transactions that should not be visible to the
current transaction. Oracle checks the commit status of a transaction by
looking up the associated Rollback segment header, but, to save time, the first
time a transaction is looked up, its status is recorded in the page itself to avoid
future lookups.

If the page is found to contain the effects of invisible transactions, then Oracle
recreates an older version of the page by undoing the effects of each such
transaction. It scans the undo records associated with each transaction and
applies them to the page until the effects of those transactions are removed.
The new page created this way is then used to access the tuples within it.

Record Header in Oracle
A row header never grows, always a fixed size. For non-cluster tables,
the row header is 3 bytes. One byte is used to store flags, one byte to
indicate if the row is locked (for example because it's updated but not
committed), and one byte for the column count.

MVCC in SQL Server
Snapshot isolation and read committed using row versioning are enabled
at the database level. Only databases that require this option must enable
it and incur the overhead associated with it.

Versioning effectively starts with a copy-on-write mechanism that is
invoked when a row is modified or deleted. Row versioning–based
transactions can effectively "view" the consistent version of the data
from these previous row versions.

Row versions are stored within the version store that is housed within the
tempdb database. More specifically, when a record in a table or index is
modified, the new record is stamped with the "sequence_number" of the
transaction that is performing the modification.
The old version of the record is copied to the version store, and the new record
contains a pointer to the old record in the version store.
If multiple long-running transactions exist and multiple "versions" are required,
records in the version store might contain pointers to even earlier versions of
the row.

Version store cleanup in SQL Server
SQL Server manages the version store size automatically, and maintains a
cleanup thread to make sure it does not keep versioned rows around longer
than needed. For queries running under Snapshot Isolation, the version
store retains the row versions until the transaction that modified the data
completes and the transactions containing any statements that reference the
modified data complete. For SELECT statements running under
Read Committed Snapshot Isolation, a particular row version is no longer
required, and is removed, once the SELECT statement has executed.

If tempdb actually runs out of free space, SQL Server calls the cleanup
function and will increase the size of the files, assuming we configured the
files for auto-grow. If the disk gets so full that the files cannot grow,
SQL Server will stop generating versions. If that happens, any snapshot
query that needs to read a version that was not generated due to space
constraints will fail.

Record Header in SQL Server
4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap. This is
offset to some actual data in record (fixed length columns).

Versioning tag - this is a 14-byte structure that contains a timestamp
plus a pointer into the version store in tempdb.
Here timestamp is trasaction_seq_number, the only time that rows get
versioning info added to record is when it’s needed to support a
versioning operation.

As the versioning information is optional, I think that is the reason
they could store this info in index records as well without much
impact.

Database	PostgreSQL	Oracle	SQL Server
Storage for Old Versions	In the main Segment (Heap/Index)	In the separate segment (Rollback Segment/Undo)	In the separate database (tempdb – known as version store)
Size of Tuple Header (bytes)	24	3	Fixed – 4 Variable - 14
Clean up	Vacuum	System Monitor Process (SMON)	Ghost Cleanup task

Conclusion of study
As other databases store version/visibility information in index, that makes
index cleanup easier (as it is no longer tied to heap for visibility information).
The advantage for not storing the visibility information in index is that for
Delete operations, we don't need to perform an index delete and probably the
size of index record could be somewhat smaller.

Oracle and probably MySQL (Innodb) needs to write the record in undo
segment for Insert statement whereas in PostgreSQL/SQL Server, the new
record version is created only when a row is modified or deleted.

Only changed values are written to undo whereas PostgreSQL/SQL Server
creates a complete new tuple for modified row. This avoids bloat in the main
heap segment.

Both Oracle and SQL Server has some way to restrict the growth of version
information whereas PostgreSQL/PPAS doesn't have any way.

Saturday 24 January 2015

Read Scalability in PostgreSQL 9.5

In PostgreSQL 9.5, we will see a boost in scalability for read workload
when the data can fit in RAM. I have ran a pgbench read-only load to
compare the performance difference between 9.4 and HEAD (62f5e447)
on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM
and here is the performance data

The data is mainly taken for 2 kind of workloads, when all the data fits

in shared buffers (scale_factor = 300) and when all the data can't fit in

shared buffers, but can fit in RAM (scale_factor = 1000).

First lets talk about 300 scale factor case, in 9.4 it peaks at 32 clients,

now it peaks at 64 clients and we can see the performance improvement

upto (~98%) and it is better in all cases at higher client count starting from

32 clients. Now the main work which lead to this improvement is
commit - ab5194e6 (Improve LWLock scalability). The previous implementation
has a bottleneck around spin locks that were acquired for LWLock
Acquisition and Release and the implantation for 9.5 has changed the
LWLock implementation to use atomic operations to manipulate the state.
Thanks to Andres Freund (and according to me the credit goes to reviewers
(Robert Haas and myself) as well who have reviewed multiple versions
of this patch) author of this patch due to whom many PostgreSQL users will
be happy.

Now lets discuss about 1000 scale factor case, in this case, we could

see the good performance improvement (~25%) even at 32 clients and it

went upto (~96%) at higher client count, in this case also where in 9.4

it was peaking at 32 client count, now it peaks at 64 client count and

the performance is better at all higher client counts. The main work

which lead to this improvement is commit id 5d7962c6 (Change locking

regimen around buffer replacement) and commit id 3acc10c9 (Increase

the number of buffer mapping partitions to 128). In this case there were

mainly 2 bottlenecks (a) a BufFreeList LWLock was getting acquired to

find a free buffer for a page (to find free buffer, it needs to execute

clock sweep) which becomes bottleneck when many clients try to perform the

same action simultaneously (b) to change the association of buffer in

buffer mapping hash table a LWLock is acquired on a hash partition to

which the buffer to be associated belongs and as there were just 16

such partitions, there was huge contention when multiple clients starts

operating on same partition. To reduce the bottleneck due to (a), used

a spinlock which is held just long enough to pop the freelist or advance

the clock sweep hand, and then released. If we need to advance the

clock sweep further, we reacquire the spinlock once per buffer. To reduce

the bottleneck due to (b), increase the buffer partitions to 128. The crux

of this improvement is that we had to resolve both the bottlenecks (a and b)

together to see a major improvement in scalability. The initial patch for

this improvement is prepared by me and then Robert Haas extracted the

important part of patch and committed the same. Many thanks to both

Robert Haas and Andres Freund who not only reviewed the patch, but

given lot of useful suggestions during this work.

During the work on improvements in buffer management, I noticed that

the next bigger bottleneck that could buy us reasonably good improvement

in read workloads is in dynamic hash tables used to manage shared buffers,

so improving the concurrency of dynamic hash tables could help further

improving the read operation. There was some discussion about using

concurrent hash table for shared buffers (patch by Robert Haas), but still

it has not materialized.