Tuesday 17 March 2015

Different Approaches for MVCC used in well known Databases

Database Management Systems uses MVCC to avoid the problem of
Writers blocking Readers and vice-versa, by making use of multiple
versions of data.

There are essentially two approaches to multi-version concurrency.

Approaches for MVCC
The first approach is to store multiple versions of records in the
database, and garbage collect records when they are no longer
required. This is the approach adopted by PostgreSQL and
Firebird/Interbase. SQL Server also uses somewhat similar approach
with the difference that old versions are stored in tempdb
(database different from main database).

The second approach is to keep only the latest version of data in
the database, but reconstruct older versions of data dynamically
as required by using undo. This is approach adopted by Oracle
and MySQL/InnoDB

MVCC in PostgreSQL
In PostgreSQL, when a row is updated, a new version (called a tuple)
of the row is created and inserted into the table. The previous version
is provided a pointer to the new version. The previous version is
marked “expired", but remains in the database until it is garbage collected.

In order to support multi-versioning, each tuple has additional data
recorded with it:
xmin - The ID of the transaction that inserted/updated the
row and created this tuple.
xmax - The transaction that deleted the row, or created a
new version of this tuple. Initially this field is null.

Transaction status is maintained in CLOG which resides in $Data/pg_clog.
This table contains two bits of status information for each transaction;
the possible states are in-progress, committed, or aborted.

PostgreSQL does not undo changes to database rows when a transaction
aborts - it simply marks the transaction as aborted in CLOG . A PostgreSQL
table therefore may contain data from aborted transactions.

A Vacuum cleaner process is provided to garbage collect expired/aborted
versions of a row. The Vacuum Cleaner also deletes index entries
associated with tuples that are garbage collected.

A tuple is visible if its xmin is valid and xmax is not.
“Valid" means “either committed or the current transaction".
To avoid consulting the CLOG table repeatedly, PostgreSQL maintains
status flags in the tuple that indicate whether the tuple is “known committed"
or “known aborted".

MVCC in Oracle
Oracle maintain old versions in rollback segments (also known as
'undo log').  A transaction ID is not a sequential number; instead, it is
made of a set of numbers that points to the transaction entry (slot) in a
Rollback segment header.

Rollback segments have the property that new transactions can reuse
storage and transaction slots used by older transactions that are
committed or aborted.
This automatic reuse facility enables Oracle to manage large numbers
of transactions using a finite set of rollback segments.

The header block of the rollback segment is used as a transaction table.
Here the status of a transaction is maintained (called System Change Number,
or SCN, in Oracle).  Rather than storing a transaction ID with each row
in the page, Oracle saves space by maintaining an array of unique transactions
IDs separately within the page, and stores only the offset of this array with
the row.

Along with each transaction ID, Oracle stores a pointer to the last undo record
created by the transaction for the page.  Not only are table rows stored in this
way, Oracle employs the same techniques when storing index rows. This is
one of the major difference between PostgreSQL and Oracle.

When an Oracle transaction starts, it makes a note of the current SCN. When
reading a table or an index page, Oracle uses the SCN number to determine if
the page contains the effects of transactions that should not be visible to the
current transaction.  Oracle checks the commit status of a transaction by
looking up the associated Rollback segment header, but, to save time, the first
time a transaction is looked up, its status is recorded in the page itself to avoid
future lookups.

If the page is found to contain the effects of invisible transactions, then Oracle
recreates an older version of the page by undoing the effects of each such
transaction. It scans the undo records associated with each transaction and
applies them to the page until the effects of those transactions are removed.
The new page created this way is then used to access the tuples within it.

Record Header in Oracle
A row header never grows, always a fixed size. For non-cluster tables,
the row header is 3 bytes.  One byte is used to store flags, one byte to
indicate if the row is locked (for example because it's updated but not
committed), and one byte for the column count.

MVCC in SQL Server
Snapshot isolation and read committed using row versioning are enabled
at the database level.  Only databases that require this option must enable
it and incur the overhead associated with it.

Versioning effectively starts with a copy-on-write mechanism that is
invoked when a row is modified or deleted. Row versioning–based
transactions can effectively "view" the consistent version of the data
from these previous row versions.

Row versions are stored within the version store that is housed within the
tempdb database.  More specifically, when a record in a table or index is
modified, the new record is stamped with the "sequence_number" of the
transaction that is performing the modification.
The old version of the record is copied to the version store, and the new record
contains a pointer to the old record in the version store.
If multiple long-running transactions exist and multiple "versions" are required,
records in the version store might contain pointers to even earlier versions of
the row.

Version store cleanup in SQL Server
SQL Server manages the version store size automatically, and maintains a
cleanup thread to make sure it does not keep versioned rows around longer
than needed.  For queries running under Snapshot Isolation, the version
store retains the row versions until the transaction that modified the data
completes and the transactions containing any statements that reference the
modified data complete.  For SELECT statements running under
Read Committed Snapshot Isolation, a particular row version is no longer
required, and is removed, once the SELECT statement has executed.

If tempdb actually runs out of free space, SQL Server calls the cleanup
function and will increase the size of the files, assuming we configured the
files for auto-grow.  If the disk gets so full that the files cannot grow,
SQL Server will stop generating versions. If that happens, any snapshot
query that needs to read a version that was not generated due to space
constraints will fail.

Record Header in SQL Server
 4 bytes long
 - two bytes of record metadata (record type)
 - two bytes pointing forward in the record to the NULL bitmap. This is
   offset to some actual data in record (fixed length columns).

Versioning tag - this is a 14-byte structure that contains a timestamp
plus a pointer into the version store in tempdb.
Here timestamp is trasaction_seq_number, the only time that rows get
versioning info added to record is when it’s needed to support a
versioning operation.

As the versioning information is optional, I think that is the reason
they could store this info in index records as well without much

Database PostgreSQL Oracle SQL Server
Storage for Old Versions In the main Segment (Heap/Index) In the separate segment (Rollback Segment/Undo) In the separate database (tempdb – known as version store)
Size of Tuple Header (bytes) 24 3 Fixed – 4 Variable - 14
Clean up Vacuum System Monitor Process (SMON) Ghost Cleanup task

Conclusion of study
As other databases store version/visibility information in index, that makes
index cleanup easier (as it is no longer tied to heap for visibility information).
The advantage for not storing the visibility information in index is that for
Delete operations, we don't need to perform an index delete and probably the
size of index record could be somewhat smaller.

Oracle and probably MySQL (Innodb) needs to write the record in undo
segment for Insert statement whereas in PostgreSQL/SQL Server, the new
record version is created only when a row is modified or deleted.

Only changed values are written to undo whereas PostgreSQL/SQL Server
creates a complete new tuple for modified row.  This avoids bloat in the main
heap segment.

Both Oracle and SQL Server has some way to restrict the growth of version
information whereas PostgreSQL/PPAS doesn't have any way.


  1. AFAIK, Oracle does not write inserted row o undo, only updated or deleted, as in PostgreSQL.

    1. *to undo

      I use PostgreSQL for a couple of years but I still don't understand advantages of vacuum approach instead of undo :(

    2. For Insert operation, it do write some information in undo so that it can delete the row during Rollback, it don't write complete row for that purpose though. The main intention to mention that point is that systems that use Rollback Segements' have some (not too much) overhead for Insert operation as compare to systems that don't maintain Rollback segments'.

    3. Some advantages of the PostgreSQL way are that the Oracle way makes rollback very expensive,and requires that you manage two sets of storage. The biggest advantage was the lack of bloat in your main tablespace when your dead rows exceeded your freespace map. Now that the map is dynamically sized, that part is better, but there can still be issues with blocked vacuum processes allowing bloat, and you still have transaction id wraparound as a potential issue. Having worked with both, they each have advantages and disadvantages, but I find the PostgreSQL way to be generally easier to manage and more efficient.

    4. Grant, the one major difference is that in PostgreSQL style cleanup, even if Vacuum cleans up the dead rows, still the space will not be released unless it is at the end of table. OTOH, in Oracle, all the space can be reclaimed.

  2. Volk, there are some advantages and some disadvantages for each of the method. One advantage of using Vacuum based method is that for Delete operation, there is no overhead of maintaining undo and also Rollback is very cheap in PostgreSQL. On the other hand, the advantage of maintaining undo is that it helps to control bloat (due to multiple versions) in a much better way and it seems that's what majority people care.

    1. Thanks for your explanation, Amit.

      And one more question. Am I right that after update of a tuple PostgreSQL stores "symlink" from old tuple to new one only? And not in the opposite direction. So right now there is no ability to implement in PostgreSQL interface like "SELECT ... AS OF TIMESTAMP current_timestamp - '1 minute'::interval"?

  3. Yes the link will be stored from old to new tuple.
    There is no direct ability in PostgreSQL for "AS OF TIMESTAMP" (aka Flashback in Oracle), however you can once look at few timetravel functions at link : http://www.postgresql.org/docs/devel/static/contrib-spi.html and see if that can serve your purpose.

  4. 24 -- too wide.
    tableoid - why do we store per each tuple? wouldn't it be better/possible to store tableoid once per page instead?

    have ideas?

    1. One of the reasons is that it can be used in constraint evaluation and even if we make changes such that we can store it at page level, I don't think there will be much benefit. PostgreSQL always consider the length based on alignment, so saving 4 bytes from tuple header won't change the size on 64bit m/c's.

      Apart from that there could be some ways like storing versioned tuple in some other place (undo or version store) Or by having variable length tuple header such that we store some of the information (versioning related information) only when it is required. However this may require quite invasive changes in backend code and we also need to take care of providing way to upgrade the earlier version databases if we change the tuple format.

      In short, it is valuable if we can reduce the size of fixed tuple header, however the project to do that will also be quite complex.

    2. My insane thoughts were about:
      1) tableoid lift up into page and
      2) varlen messagepack for tuple header moreover with internal delta encoding for xmin/xmax, cmin/cmax.

      Something like this.

    3. What do mean by delta encoding for xmin/xmax, .., do you want to store it in some differential format, so that they can be reconstructed when required?
      To me, Idea-wise it sounds like worth further investigation, however you have to be careful that such a change might require huge changes in backend code including a mechanism for upgrade. Apart from that, I think there are some downsides of variable length header as well like when you try to attach such a header to tuple, it might not fit in current page, then you have to shift that tuple to new page which can make such an update quite expensive.

    4. I'll try investigate. Delta encoding: xmin, and not full xmax but diff to xmin or vice versa; and not cmax but diff to cmin. Maybe it could give something.

    5. For cmin and cmax, there is already an optimisation done in form of combocid's in code to reduce space. If you can come up with some good solution, then I suggest it is better to discuss with PostgreSQL community.

  5. Is it possible to consider a hybrid approach? If the application can participate in the decision to use an undo log, you can exploit many factors (likelyhood of rollback, tolerance of rollback etc) that may be specific to the application that might tune performance.

    Or, you could use rollback log for small transactions and start writing into the heap after a certain number of pages gets exceeded.

  6. Good point, though this overhead should be very small, but if it happens for every other tuple, then it might matter. However, I don't expect any big difference due to this.