Two-phase commit in distributed transactions
Reading: MIT 6.033 - Principles of Computer System Design: Chapter 9 : 9.1.5, 9.1.6, 9.5.2, 9.5.3, 9.6.3
Before-or-after Atomicity
Before-or-after atomicity: concurrent actions that appear to occur completely before or after one another.
It’s trivial to perform concurrent computation for tasks that don’t interact with each other. However, when threads try to read/write to the same memory, things get complicated.
Traditionally, we use locks to protect shared data. But we need to make sure that every action that touches a shared variable follows the locking protocol. This is hard, as the programmer doesn’t necessarily know all the other actions that might touch the shared variable.
Example: reading some data in Thread 1; while Thread 2 has read and is going to write back to the same location -> incorrect result.
We need a discipline in interacting with shared data to ensure correct coordination.
Correctness and Serialization
Building on the idea of before-or-after atomicity, we define a correct concurrent execution: every result (achieved by concurrent execution/operations) is achievable via some serial application of the same operations (in some order).
In the example of Fig 9.2, any concurrent execution of the two transactions is correct if the result is 85 - achieved by the serial application of the two operations.
Serializability: there exists some serial order of the concurrent transactions that would lead to the same ending state, if followed.
Some requirement that are stronger than serializability:
- External time consistency: if
T1
ended beforeT2
begins, the serialization order must beT1, T2
- Sequential consistency: given a sequence of operations, the concurrent execution result must be the same as the serial application of the given sequence
Simple Locking
One approach to implementing transactions is simple locking: locking every piece of data that might be read before executing the computation. This guarantees correctness, but leaves a lot of room for improvement in performance, because it’s locking up excess data that might not even be read eventually. Consider an if-else statement, a program will need to lock the data in both the if and else case; despite that only 1 of them will be used.
Two Phase Locking
In Two Phase Locking (2PL):
- a transaction must acquire a record’s lock before using it
- a transaction must hold its locks until after commit or abort
It is better than Simple Locking, as the execution gradually acquires locks, so that it allows some possible data parallelism.
It implements serializability. If two concurrent operations are before-and-after operations, they will need to acquire locks for the shared/overlapping data. The locks impose an implicit order to the operations.
The reason to hold the lock until after commit or abort is to prevent the following scenario: thread A writes some data and releases the lock, thread B acquires the lock and reads the data, thread A decides to abort the transaction and unwind the changes; now thread B has erroneous data in an undefined state.
Note that 2PL produce deadlocks easily:
T1 T2
get(x) get(y)
get(y) get(x)
The system must detect (cycles? lock timeout?) and abort a transaction.
One way to solve this deadlock problem that came up during our discussion
is to maintain a total order for the locks, so that lock x
is always
acquired before lock y
.
Distributed Transactions - Two Phase Commit
In this scenario, we want to build a system that exposes some atomic endpoints, such that each endpoint performs atomic operations on a distributed cluster of machines. By atomicity, it implies that if 1 machine in the cluster fails to complete the operation, the whole atomic operation (as exposed by the endpoint) should fail.
The 2PC system consists of Transaction Coordinators and some Workers. The
Coordinator sends RPC to the Workers, and asks them to perform some task
(in the PREPARE
message).
Upon receiving the RPC, each server either
- acquires the local locks, performs the task and enters pre-commit state, or
- aborts.
Either way, the Worker replies with the result (pre-commited or aborted in a PREPARED
message)
to the Coordinator, enters the PREPARED
state, and wait for a
reply. Note that the Workers don’t release their locks yet at this step,
so any other RPC that wants to operate on the same chunk of data as the current
transaction will have to wait, until the Coordinator decides to Commit or Abort
the transaction and tells the workers to COMMIT
, and the Workers
then the release their locks. It’s also worth nothing that, once a Worker
enters the PREPARED
state, it cannot unilaterally abort; since
the Coordinator will assume that the Worker will (eventually) commit.
Recovering from Failures
Distributed systems are all about handling failures, here we see how this system responses to failures.
This system trades availability for stronger consistency. If any of the node, either Coordinator or Worker is down, the system blocks and becomes unavailable.
The Worker can fail at these instances:
- Before receiving the task from the Coordinator
- After receiving the task, before replying
PREPARED
to the Coordinator - After replying
PREPARED
, before receivingCOMMIT
- After receiving
COMMIT
The Coordinator is a persistent sender. In both (1) and (2), the Coordinator doesn’t receive a response, and it will retry sending the task, or reallocating the task to another Worker.
In (2), the changes applied can be stored into a volatile memory, so that it gets
clean-up during a failure recovery. If the Coordinator asks again, it can respond
with Abort
, and the atomicity consistency still holds, since the Coordinator
can’t commit the message without this Worker’s PREPARED
message.
(3) is trickier. Because it promised to the Coordinator to commit the
operation, and the Coordinator will assume that it will commit. Therefore,
before the Worker replies PREPARED
, it must persist all the information
required to commit the transaction, including the locks that it needs to hold.
This can be stored on disk in a log, so that it can rewind the operation upon
recovering from the failure by reading the logs.
In (4), the Worker has already partially-commited the data and the data is stored on disk, so there’s not much concern here. It can just commit the data using the log mentioned previously.
These notes are summarized from the readings, lecture notes, and lecture videos in MIT 6.824 Distributed Systems . I took this module under the NUS Design-Your-Own-Module initiative with a group of NUS students, where we would meet weekly and discuss about the readings.