Two-phase commit in distributed transactions

Two-phase commit in distributed transactions Feb 15, 2021

Reading: MIT 6.033 - Principles of Computer System Design: Chapter 9 : 9.1.5, 9.1.6, 9.5.2, 9.5.3, 9.6.3

Before-or-after Atomicity

Before-or-after atomicity: concurrent actions that appear to occur completely before or after one another.

It’s trivial to perform concurrent computation for tasks that don’t interact with each other. However, when threads try to read/write to the same memory, things get complicated.

Traditionally, we use locks to protect shared data. But we need to make sure that every action that touches a shared variable follows the locking protocol. This is hard, as the programmer doesn’t necessarily know all the other actions that might touch the shared variable.

Example: reading some data in Thread 1; while Thread 2 has read and is going to write back to the same location -> incorrect result.

We need a discipline in interacting with shared data to ensure correct coordination.

Correctness and Serialization

Building on the idea of before-or-after atomicity, we define a correct concurrent execution: every result (achieved by concurrent execution/operations) is achievable via some serial application of the same operations (in some order).

In the example of Fig 9.2, any concurrent execution of the two transactions is correct if the result is 85 - achieved by the serial application of the two operations.

Serializability: there exists some serial order of the concurrent transactions that would lead to the same ending state, if followed.

Some requirement that are stronger than serializability:

External time consistency: if T1 ended before T2 begins, the serialization order must be T1, T2
Sequential consistency: given a sequence of operations, the concurrent execution result must be the same as the serial application of the given sequence

Simple Locking

One approach to implementing transactions is simple locking: locking every piece of data that might be read before executing the computation. This guarantees correctness, but leaves a lot of room for improvement in performance, because it’s locking up excess data that might not even be read eventually. Consider an if-else statement, a program will need to lock the data in both the if and else case; despite that only 1 of them will be used.

Two Phase Locking

In Two Phase Locking (2PL):

a transaction must acquire a record’s lock before using it
a transaction must hold its locks until after commit or abort

It is better than Simple Locking, as the execution gradually acquires locks, so that it allows some possible data parallelism.

It implements serializability. If two concurrent operations are before-and-after operations, they will need to acquire locks for the shared/overlapping data. The locks impose an implicit order to the operations.

The reason to hold the lock until after commit or abort is to prevent the following scenario: thread A writes some data and releases the lock, thread B acquires the lock and reads the data, thread A decides to abort the transaction and unwind the changes; now thread B has erroneous data in an undefined state.

Note that 2PL produce deadlocks easily:

  T1      T2
  get(x)  get(y)
  get(y)  get(x)

The system must detect (cycles? lock timeout?) and abort a transaction. One way to solve this deadlock problem that came up during our discussion is to maintain a total order for the locks, so that lock x is always acquired before lock y.

Distributed Transactions - Two Phase Commit

In this scenario, we want to build a system that exposes some atomic endpoints, such that each endpoint performs atomic operations on a distributed cluster of machines. By atomicity, it implies that if 1 machine in the cluster fails to complete the operation, the whole atomic operation (as exposed by the endpoint) should fail.

The 2PC system consists of Transaction Coordinators and some Workers. The Coordinator sends RPC to the Workers, and asks them to perform some task (in the PREPARE message).

Upon receiving the RPC, each server either

acquires the local locks, performs the task and enters pre-commit state, or
aborts.

Either way, the Worker replies with the result (pre-commited or aborted in a PREPARED message) to the Coordinator, enters the PREPARED state, and wait for a reply. Note that the Workers don’t release their locks yet at this step, so any other RPC that wants to operate on the same chunk of data as the current transaction will have to wait, until the Coordinator decides to Commit or Abort the transaction and tells the workers to COMMIT, and the Workers then the release their locks. It’s also worth nothing that, once a Worker enters the PREPARED state, it cannot unilaterally abort; since the Coordinator will assume that the Worker will (eventually) commit.

Recovering from Failures

Distributed systems are all about handling failures, here we see how this system responses to failures.

This system trades availability for stronger consistency. If any of the node, either Coordinator or Worker is down, the system blocks and becomes unavailable.

The Worker can fail at these instances:

Before receiving the task from the Coordinator
After receiving the task, before replying PREPARED to the Coordinator
After replying PREPARED, before receiving COMMIT
After receiving COMMIT

The Coordinator is a persistent sender. In both (1) and (2), the Coordinator doesn’t receive a response, and it will retry sending the task, or reallocating the task to another Worker.

In (2), the changes applied can be stored into a volatile memory, so that it gets clean-up during a failure recovery. If the Coordinator asks again, it can respond with Abort, and the atomicity consistency still holds, since the Coordinator can’t commit the message without this Worker’s PREPARED message.

(3) is trickier. Because it promised to the Coordinator to commit the operation, and the Coordinator will assume that it will commit. Therefore, before the Worker replies PREPARED, it must persist all the information required to commit the transaction, including the locks that it needs to hold. This can be stored on disk in a log, so that it can rewind the operation upon recovering from the failure by reading the logs.

In (4), the Worker has already partially-commited the data and the data is stored on disk, so there’s not much concern here. It can just commit the data using the log mentioned previously.

These notes are summarized from the readings, lecture notes, and lecture videos in MIT 6.824 Distributed Systems . I took this module under the NUS Design-Your-Own-Module initiative with a group of NUS students, where we would meet weekly and discuss about the readings.