PostgreSQL and Databases in general: Write Scalability in PostgreSQL

I have ran some benchmark tests to see the Write performance/scalability in

PostgreSQL 9.5 and thought it would be good to share the same with others,
so writing this blog post.

I have ran a pgbench tests (TPC-B (sort of) load) to compare the performance

difference between different modes and scale factor in HEAD (e5f455f5) on

IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM

and here are the performance results

Some of the default settings used in all the tests are:

min_wal_size=15GB

max_wal_size=20GB

checkpoint_timeout =35min

maintenance_work_mem = 1GB

checkpoint_completion_target = 0.9

autovacuum=off

I have kept auto vacuum as off to reduce the fluctuation due to same and is

dropping and re-creating the database after each run. I have kept high values

of min_wal_size and max_wal_size to reduce the effect of checkpoints, probably

somewhat lower values could have served the purpose of this workload, but I

haven't tried it.

The data is mainly taken for 2 kind of modes (synchronous_commit = on | off) and

at 2 different scale factors to cover the cases when all the data fits in shared buffers

(scale_factor = 300) and when all the data can't fit in shared buffers, but can fit in

RAM (scale_factor = 3000).

First lets talk about synchronous_commit = off case, here when all the data fits in

shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count

with TPS being approximately 75 percent higher at 64 client-count as compare to 8

client count which doesn't look bad. When all the data doesn't fit in shared buffers,

but fit in RAM (scale_factor = 3000), we can see scalability upto 32 client-count with

TPS being 64 percent higher than at 8 client-count and then it falls there on.

One major difference in case of Writes when data doesn't fit in shared_buffers is

that backends performing transactions needs to write the dirty buffers themselves

when they are not able to find a clean buffer to read the page, this can hamper

the TPS.

Now let's talk about synchronous_commit = on case, here when all the data fits in

shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count

with TPS being approximately 189 percent higher at 64 client-count as compare to

8 client count which sounds good. When all the data doesn't fit in shared buffers,

but fit in RAM (scale_factor = 3000), we can see a pretty flat graph with some

scalability upto 16 client-count with TPS being approximately 22 percent higher than

at 8 client-count and then it stays as it is.

Here one point to note is that when the data fits in shared_buffers (scale_factor = 300),

TPS at higher client-count (64) in synchronous_commit = on mode becomes equivalent to

TPS in synchronous_commit = off which suggests that there is no major contention

due to WAL writing in such loads.

In synchronous_commit = on case, when the data doesn't fit in shared_buffers

(scale_factor = 3000), the TPS is quite low and one reason is that backends might

be performing writes themselves, but not sure if the performance is so low just

due to that reason as I have tried with different values of Bgwriter related parameters

(bgwriter_delay, bgwriter_lru_maxpages, bgwriter_lru_multiplier), but there is no

much difference.

As per my knowledge, the locks that can lead to contention for this workload

are:

a. ProcArrayLock (used for taking snapshot and at transaction commit)

b. WALWriteLock (used for performing WALWrites)

c. CLOGControlLock (used to read and write transaction status)

d. WALInsertLocks (used for writing data to WAL buffer)

I think among these ProcArrayLock and WALWriteLock are the candidates

which can be the reason for contention, but I haven't done any deep analysis

to find out the same.

Now it could be that the bottleneck is due to multiple locks as was the case

for read operations which I have explained in my previous Read Scalability

blog or it could be due to one of these locks. I think all this needs further

analysis and work. Thats all what I want to say for now.

7 comments:

Anonymous18 April 2015 at 02:06
There is something wrong with the link to the images, can't see the images.
Jamey Hanson20 April 2015 at 05:37
What is the working definition of "scale factor"?
volk20 April 2015 at 08:25
Well, most likely the problem is inside the clocksweep while searching for a victim buffer to replace from shared_buffers. I have described it here [0] and in the bottom there is a stap for seeing that.

In that post there also are links to a couple of patches in 9.5 (and you are the reviewer of one of them :)) which should make it a bit better. Your post about read scalability showed that it really makes it better for read-only load (and my testing also confirms it), but it seems that it is not so for heavy writing load :( I haven't finished my testing of 9.5 with writing load yet but early results do not show much difference with 9.4.

[0] https://simply.name/postgresql-and-systemtap.html

Friday, 17 April 2015

Write Scalability in PostgreSQL

7 comments: