Is wsrep single-threaded?

I’m comparing the performance of XtraDB Cluster with XtraDB using sysbench. What I’ve seen makes me think that wsrep is single-threaded.

For example, I run a sysbench oltp test on the database host. The results are identical for XtraDB (with no replication) and a single cluster node with ‘wsrep_on = OFF’. In both cases, the total execution time for a mixed read/write workload falls from 400 sec to 100 sec as the number of (sysbench) threads is increased from 1 to 8.

However, when ‘wsrep_on = ON’, the total execution time for a single node remain almost constant as threads are added. There is only a small decrease from 400 to 360 sec when a second (sysbench) thread is added but no further change with additional threads.

This is an artificial test and I’m not proposing to run a single node cluster! However, it seems to show that processes which XtraDB can process in parallel, a single node XtraDB Cluster (with wsrep_on = ON) processes in serial. Can anyone confirm that or is there some other explanation?

What you are seeing is the introduction of “CERTIFICATION” to the commit process by enabling wsrep. The lifecyle of a transaction in Galera looks like this:

  1. Source node: transaction works as normal, client does COMMIT (or autocommit)
  2. On Commit:
    a) transaction is replicated to all nodes in the cluster (if any) – GTID cluster-wide transaction ordering is determined here.
    b) transaction is certified, if it fails the client gets a “deadlock”. This will only can happen if conflicting writes were replicated before this transaction.
    c) if certification passes, local node does normal innodb commit
  3. On other nodes (if any),
    a) node synchronously receives transaction (from 2a)
    b) transaction is certified, if fails transaction is dropped (no reply to source node).
    c) transaction is applied (this can be done in parallel when it safe and if you set wsrep_slave_threads > 1)
    d) transaction is committed locally

There are important points here were transactions are serialized in the stream. They are:

  • 2a
  • 2b (and by extension 2c)
  • on the other nodes (if any) 3b and 3d

This serialization is because you can’t just intersperse random transactions together in any old order and actually get nodes that have the same data.

All this being said, certification must be done serially and that does tend to slow some things down a bit, even on a single node cluster. innodb_flush_log_at_trx_commit=1 tends to make this effect worse (in my testing). Galera 3 (and PXC 5.6 which is in beta in our experimental repos) is supposed to be more efficient at certification, so I’d be interested to see how your tests run there (been meaning to do this myself).

Thanks for the clear explanation. Performance improved by an order of magnitude when I set [COLOR=#252C2F]innodb_flush_log_at_trx_commit=2. I realise that there is a risk of losing 1 sec of transactions if a server fails but it seems to be a safe choice in a clustered environment (unless all the nodes fail!).

Precisely. I’ve been wondering if 5.6/Galera 3 improves this at all, but I haven’t gotten to testing it yet.