I've been dealing with MySQL Cluster in one way or another since around 2005 or so (back in the MySQL 4.1 days) but it is still full of "funny" surprises. This post is a collection of different locking related issue i ran into during the previous weeks that i had not been aware of up to now (or simply may have forgotten over time)
== Unique hash indexes lock exclusively ==
This is the one that regular users are most likely to run into: in general row logs in MySQL Cluster distinguish between reads and writes so that writers can block other writers, but not readers and readers from other transactions always see the last committed row value (Cluster currently only supports the READ COMMITTED isolation level). As soon as you have a secondary unique index in addition to a primary key things are different though. Internally a unique index that is not the primary key is implemented as a unique hash index in Cluster (and optionally also as an additional ordered T-Tree), and for unique hash entries there only seems to be exclusive locking so that writers will block readers across transactions.
== Long running transactions can block starting nodes ==
Less likely to happen, but very annoying if it happens and you don't know about it:
A starting node needs to lock all rows for a very short period of time near the end of start phase 5. At this point it needs to wait for all current transaction to free the locks they are holding (by either COMMIT or ROLLBACK). This is usually not a problem as cluster transaction are not supposed to last very long anyway, but if there happens to be a long running transaction it can potentially block a node start forever. A stopping node will wait for pending transactions for 5 seconds and will then terminate these transactions the hard way, a starting node on the other hand will gracefully wait forever.
This is bad in two ways:
1) it is obviously bad availability wise as you may end up with a non-redundant configuration for extended periods of time and without any way to automatically identify and terminate the offending transaction(s)
2) currently the blocked starting node does not even tell what it is waiting for, it just silently sits and waits. So unless you know what is going on you are faced with a node that is simply stuck without doing anything (CPU, disc and network load next to zero)
The related bug report is http://bugs.mysql.com/65037 , proposed solutions are meaningful log messages in the short term and killing active transactions after a grace period in the long run.
== Weird error message on lock timeout ==
This last item is related to the non-standard INSERT ... ON DUPLICATE KEY UPDATE construct. What can happen here is that the INSERT part fails with a lock wait timeout as the key is write locked by another transaction. The expected error message in this case would simply be "Lock wait timeout exceed" but what you'll actually get is "Transaction already aborted". What seems to happen here is that the ON DUPLICATE KEY UPDATE part is triggered by any error in the INSERT phase and not only by duplicate key violation. So the INSERT part is tried tried, fails with "Lock wait timeout" (which implicitly rolls back the current transaction), next the UPDATE part is tried, fails with "Transaction already aborted" and overwrites the previous error ...
Note that the lock taken on a unique index during a lookup is a shared-read lock, rather than an exclusive lock. So other readers are unaffected. Of course, a shared-read lock cannot be obtained if the row is already locked by a write, or an Exclusive read, and if it is locked with a shared-read lock then other transactions cannot write to it, or obtain an exclusive lock. Tables with Blobs have a similar behaviour (as you probably recall), and we alleviated the impact there slightly by taking the shared read lock for a shorter time. The same approach may be useful here.
The behaviour you observe during node recovery is interesting. I'm not sure that 'all rows are locked' at some point during startup, but obviously there's some issue there with a long running transaction. Your proposed solutions sound like a good approach.
The 'Transaction already aborted' seems like a bug in the Server. Thanks for raising these reports, glad to hear that it's still 'fun'