Chapter 5: Consistent Hashing

To better understand consistent hashing we need to first look at the problem it solves.

The Problem/Scenario

More specifically, let’s look at what would we do if we didn’t do consistent hashing:

In distributed systems, where we scale horizontally, meaning we have multiple servers running (most of the time each having it’s own database), we need to find a way to distribute the traffic across all of these servers.

Let’s say our system has 4 servers. A quick solution to distribute the traffic between those 4 nodes is to use modulo. But how?

lets name each server (tho we can use it’s IP address as well) - s0, s1, s2, s3
we take the hash of the server name (for each server)
we % N (where N is the amount of servers, in our case - 4), so:

hash(s0) % 4 = 1 (for example)

hash(s1) % 4 = 0

hash(s2) % 4 = 3

hash(s3) % 4 = 2

great! We know have 4 servers, each having it’s own unique index and we can have data stored and retrieved from each one, consistently.

Well.. what will happen when a server goes down or we need to scale up as our 4 servers can’t handle the traffic anymore? (something very common for systems)

We deduct the servers, so we end up with 3 or we add servers and we end up with 5,6,7.. What happens with the N ☝️ in that case? It changes.

That means our module calculations will change as well. BIG red flag 🚩

This means that all data needs to be redistributed OR we will end up with a lot of cache misses (or data missing in the server we are looking at), because we will now look into the wrong server. Why? Because hash(s0) % 4 won’t be the same as hash(s0) % 7 😳

Conclusion of this approach: