Saturday 24 January 2015

Read Scalability in PostgreSQL 9.5


In PostgreSQL 9.5, we will see a boost in scalability for read workload
when the data can fit in RAM.  I have ran a pgbench read-only load to
compare the performance difference between 9.4 and HEAD (62f5e447)
on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM
and here is the performance data


















The data is mainly taken for 2 kind of workloads, when all the data fits
in shared buffers (scale_factor = 300) and when all the data can't fit in
shared buffers, but can fit in RAM (scale_factor = 1000).

First lets talk about 300 scale factor case, in 9.4 it peaks at 32 clients,
now it peaks at 64 clients and we can see the performance improvement
upto (~98%) and it is better in all cases at higher client count starting from
32 clients.  Now the main work which lead to this improvement is
commit - ab5194e6 (Improve LWLock scalability).  The previous implementation
has a bottleneck around spin locks that were acquired for  LWLock
Acquisition and Release and the implantation for 9.5 has changed the
LWLock implementation to use atomic operations to manipulate the state.
Thanks to Andres Freund (and according to me the credit goes to reviewers
(Robert Haas and myself) as well who have reviewed multiple versions
of this patch) author of this patch due to whom many PostgreSQL users will
be happy.

Now lets discuss about 1000 scale factor case,  in this case, we could
see the good performance improvement (~25%) even at 32 clients and it
went upto (~96%) at higher client count, in this case also where in 9.4
it was peaking at 32 client count, now it peaks at 64 client count and
the performance is better at all higher client counts.  The main work
which lead to this improvement is commit id 5d7962c6 (Change locking
regimen around buffer replacement) and  commit id  3acc10c9 (Increase
the number of buffer mapping partitions to 128).  In this case there were
mainly 2 bottlenecks (a) a BufFreeList LWLock was getting acquired to
find a free buffer for a page (to find free buffer, it needs to execute
clock sweep)  which becomes bottleneck when many clients try to perform the
same action simultaneously (b) to change the association of buffer in
buffer mapping hash table a LWLock is acquired on a hash partition to
which the buffer to be associated belongs and as there were just 16
such partitions, there was huge contention when multiple clients starts
operating on same partition.  To reduce the bottleneck due to (a), used
a spinlock which is held just long enough to pop the freelist or advance
the clock sweep hand, and then released.  If we need to advance the
clock sweep further, we reacquire the spinlock once per buffer.  To reduce
the bottleneck due to (b), increase the buffer partitions to 128.  The crux
of this improvement is that we had to resolve both the bottlenecks (a and b)
together to see a major improvement in scalability.  The initial patch for
this improvement is prepared by me and then Robert Haas extracted the
important part of patch and committed the same.  Many thanks to both
Robert Haas and Andres Freund who not only reviewed the patch, but
given lot of useful suggestions during this work.


During the work on improvements in buffer management, I noticed that
the next bigger bottleneck that could buy us reasonably good improvement
in read workloads is in dynamic hash tables used to manage shared buffers,
so improving the concurrency of dynamic hash tables could help further
improving the read operation.  There was some discussion about using
concurrent hash table for shared buffers (patch by Robert Haas), but still
it has not materialized.