Friday, 17 April 2015

Write Scalability in PostgreSQL




I have ran some benchmark tests to see the Write performance/scalability in
PostgreSQL 9.5 and thought it would be good to share the same with others,
so writing this blog post.


I have ran a pgbench tests (TPC-B (sort of) load) to compare the performance
difference between different modes and scale factor in HEAD (e5f455f5) on
IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM
and here are the performance results
































Some of the default settings used in all the tests are:
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout    =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
autovacuum=off

I have kept auto vacuum as off to reduce the fluctuation due to same and is
dropping and re-creating the database after each run.  I have kept high values
of min_wal_size and max_wal_size to reduce the effect of checkpoints, probably
somewhat lower values could have served the purpose of this workload, but I
haven't tried it.

The data is mainly taken for 2 kind of modes (synchronous_commit = on | off) and
at 2 different scale factors to cover the cases when all the data fits in shared buffers
(scale_factor = 300) and when all the data can't fit in shared buffers, but can fit in
RAM (scale_factor = 3000).

First lets talk about synchronous_commit = off case, here when all the data fits in
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count
with TPS being approximately 75 percent higher at 64 client-count as compare to 8
client count which doesn't look bad. When all the data doesn't fit in shared buffers,
but fit in RAM (scale_factor = 3000), we can see scalability upto 32 client-count with
TPS being 64 percent higher than at 8 client-count and then it falls there on.

One major difference in case of Writes when data doesn't fit in shared_buffers is
that backends performing transactions needs to write the dirty buffers themselves
when they are not able to find a clean buffer to read the page, this can hamper
the TPS.

Now let's talk about synchronous_commit = on case, here when all the data fits in
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count
with TPS being approximately 189 percent higher at 64 client-count as compare to
8 client count which sounds good. When all the data doesn't fit in shared buffers,
but fit in RAM (scale_factor = 3000), we can see a pretty flat graph with some
scalability upto 16 client-count with TPS being approximately 22 percent higher than
at 8 client-count and then it stays as it is.

Here one point to note is that when the data fits in shared_buffers (scale_factor = 300),
TPS at higher client-count (64) in synchronous_commit = on mode becomes equivalent to 
TPS in synchronous_commit = off which suggests that there is no major contention
due to WAL writing in such loads.

In synchronous_commit = on case, when the data doesn't fit in shared_buffers 
(scale_factor = 3000), the TPS is quite low and one reason is that backends might
be performing writes themselves, but not sure if the performance is so low just
due to that reason as I have tried with different values of Bgwriter related parameters
(bgwriter_delay, bgwriter_lru_maxpages, bgwriter_lru_multiplier), but there is no
much difference.

As per my knowledge, the locks that can lead to contention for this workload
are:
a. ProcArrayLock (used for taking snapshot and at transaction commit)
b. WALWriteLock (used for performing WALWrites)
c. CLOGControlLock (used to read and write transaction status)
d. WALInsertLocks (used for writing data to WAL buffer)

I think among these ProcArrayLock and WALWriteLock are the candidates
which can be the reason for contention, but I haven't done any deep analysis
to find out the same.

Now it could be that the bottleneck is due to multiple locks as was the case
for read operations which I have explained in my previous Read Scalability
blog or it could be due to one of these locks.  I think all this needs further
analysis and work. Thats all what I want to say for now.

7 comments:

  1. There is something wrong with the link to the images, can't see the images.

    ReplyDelete
    Replies
    1. It was visible to me when I checked in safari (Mac browser), but seems there is some problem while opening with other browsers. I have inserted the images, could you verify once again?

      Delete
    2. Now it's OK

      Delete
  2. What is the working definition of "scale factor"?

    ReplyDelete
    Replies
    1. At the default "scale factor" of 1, the tables initially contain this many rows:

      table # of rows
      ---------------------------------
      pgbench_branches 1
      pgbench_tellers 10
      pgbench_accounts 100000
      pgbench_history 0

      You can (and, for most purposes, probably should) increase the number of rows by using the -s (scale factor) option.

      For more information on workload and test tool, you can refer PostgreSQL docs:
      http://www.postgresql.org/docs/devel/static/pgbench.html
      If you still need more clarification, then do let me know.

      Delete
  3. Well, most likely the problem is inside the clocksweep while searching for a victim buffer to replace from shared_buffers. I have described it here [0] and in the bottom there is a stap for seeing that.

    In that post there also are links to a couple of patches in 9.5 (and you are the reviewer of one of them :)) which should make it a bit better. Your post about read scalability showed that it really makes it better for read-only load (and my testing also confirms it), but it seems that it is not so for heavy writing load :( I haven't finished my testing of 9.5 with writing load yet but early results do not show much difference with 9.4.

    [0] https://simply.name/postgresql-and-systemtap.html

    ReplyDelete
    Replies
    1. Yeah one part of problem could be in clock sweep and other part of the problem is that during replacement of page in shared_buffers, we need to flush a page if it is dirty and which is I think true in most write-load cases. We can improve part of it via changing bgwriter settings, but I think still it won't solve the problem. I also think there will be no visible impact in Write Scalability in 9.5. By the way, the tests you have mentioned in your blog post doesn't seem to be pgbench tests which I have ran.

      Delete