Data Replication

What is data replication?

It is a way of storing data in multiple places and it is a fundamental building block of distributed systems. When data is stored in more than one place, it eliminates having a single point of failure. This increases availability and increases scalability and performance.

A screengrab of the Spongebob Squarepants episode 'Squilliam returns'. It shows many smaller SpongeBobs who are hard at work in Spongebob's brain.

Why is it challenging? This process requires keeping replicas consistent with each other even when failures happen.

How is this implemented?

One example of an algorithm that handles the heavy lifting is called Raft replication. It is a replication protocol that provides the strongest consistency guarantee.

State machine replication

The mechanism that Raft is based on is called state machine replication. Basically the idea is that a single process (leader) broadcasts operation that change its state to the followers (replicas).

In a perfect world, if the followers execute the same sequences of operations that the leader has, then each follower will end up in the same state. The problem is that these processes can fail at any time. The reason the algorithm is complex because fault tolerance needs to be taken into account.

Whenever a process receives an operation it will transition from one state to another. This process is modeled as a state machine. This is a pretty powerful tool to achieve fault tolerance. These processes keep track of the operations that alter state in a log. This log is what allows the state to be kept in sync across processes.

When a leader wants to apply an operation to its state:

First, it appends a new entry for the operation its log but doesn’t apply it to the state yet
Then it sends a request to each follower with the new entry
When the follower gets the request, it appends the entry to its own log (also don’t execute yet) then send a response the leader acknowledging the request
When the leader finds out that the majority of followers have received the message, then it finally executes the operation locally

Handling failures

A screengrab of the Spongebob Squarepants episode 'Squilliam returns'. It shows many smaller SpongeBobs that are surrounded by fire and chaos in Spongebob's brain.

What happens if the leader fails? A follower is elected as the new leader through an election process. A follower with an out of date log cannot win the election.

What happens if the update entry request can’t be delivered? The leader will retry sending it until the majority of the followers have successfully appended it to their logs.

Consistency models

We now know how Raft replicates the leader’s state to its followers. All operations that modify state have to go through the leader. What about reads? Reads can be done by the leader or follower. This increases throughput since the reads are not limited to a single process. The problem though is now that two different clients can potentially have a different view of the system’s state.

There is a tradeoff between how consistent the data is and the system’s performance and availabilty. Consistency models help us to define the possible views the observers have of the system state.

Strong consistency: Any read operation returns the latest data. This is usually achieved by forcing a replica not to accept new reads/writes until every replica has agreed on current write.
Weak consistency: The latest value might night be reflected in subsequent read operations
Eventual consistency: Given enough time, all updates are propagated, and all replicas are consistent.

I’ve only scratched the surface here but you can see that coordination adds complexity because the algorithms need to support failures. If the network was reliable, this would be so much easier.

You probably will never have to reinvent the wheel and implement distributed algorithms like state machine replication but I think it’s important to understand the basics. Abstractions are leaky and it helps to know what’s going on under the hood when issues arise.