When I was studying distributed algorithms, I was wondering why there are suprisingly little simple algorithms or algorithms with simple explanations in distributed systems. The existing ones were often quite complicated or very old, like the Echo algorithm which has been discovered by Ernest J.H. Chang in 1982.
One reason is that distributed systems are hard. There are impossibility theorems like the FLP impossibility theorem from Fischer, Lynch and Paterson which says that no deterministic fault-tolerant consensus protocol can guarantee progress in an asynchronous network.
The problems that distributed algorithms try to solve are also hard. Normal algorithms are about sorting and searching in graphs and strings. They are described in detail in the algorithm books from Sedgewick or in the one from Cormen, Leiseron and Rivest. Distributed algorithms are more complicated, you have to deal with different times and race conditions, messages which never arrive, processors that may fail, etc. Therefore distributed algorithms are often about preventing these traps, like mutual exclusions to prevent race conditions, or about treating a distributed system as a unified one, which leads to atomic commits, election and consensus, etc.
One such algorithm is the Paxos algorithm from Leslie Lamport. It is an algorithm for solving the problem of consensus in a network of unreliable processors. Leslie Lamport himself has written a paper Paxos made simple because people complained the algorithm is too difficult to understand. And it turns it out it might not be that difficult to understand the basics at all. It is quite similar to a normal common sense approach to assemble a team of participants, let us say to play a game of “Kicker”.
Many startups have a “Kicker” table, where employees can play occasionally a game of table football for recreation. A common sense approach to assemble a team of participants is simple. Let us says the employees communicate over a chat program like Slack or Skype, and share a common channel. A typical communication might look like this:
Tom wants to propose a game
5:02 Tom: 1
5:03 David: 2
5:35 John: 3
5:39 Andrew: 4
5:40 Tom: Go
5:40 David: Go
5:41 John: Go
5:41 Andrew: Go
=> Round successful, game can be started
When the last of the 4 necessary team members has agreed to accept, then the team has reached consensus who wants to play and the game can be started immediately. Let us look at the sequence of steps a bit more detailed:
- The first participant who wants to start a game, here the one named “Tom”, starts to count, and ask the others to prepare a game. We call him proposer
- A second participant promises to participate, increases the counter and sends a corresponding message. He is called acceptor
- If three acceptors have send their promise to participate, and the counter has reached the necessary limit of 4, which we call a quorum, then the next phase can begin
- In the next phase the proposer can ask for acceptance from the acceptors. Each acceptor must acknowledge the acceptance by sending a message like “Go”.
If not enough acceptors can be found to participate in the game, the round failed, and the game can not be started. Then a new proposer must make a new proposal, and a new round is started.
John wants to propose a game
6:12 John: 1
6:13 Anne: 2
6:22 Mike: 3
7:15 Aaron: 4
7:16 John: Go
7:17 Anne: Too late, I am already going home
=> Round failed
And while it is far more simple than the real Paxos algorithm, the basic phases are quite similar. We have proposers and acceptors, phase 1 (Prepare & Promise) and phase 2 (Accept & Accepted), rounds can succeed or fail, and in the end the team must find a consensus who wants to play. You see, sometimes simple things as chat programs can help to explain complicated distributed algorithms. Slack can help to understand Paxos.