In this post, I’m going to focus more on how to go about testing the cluster. While building the Raft implementation, I realised that half the battle with distributed systems is in building a useful test harness. This has become such a problem in the distributed systems space that companies like FoundationDB and TigerBeetle have written something called a deterministic simulation testing (DST) engine. It’s a fancy term for a more advanced form of fuzz testing (atleast from what I understand).
The founder of FoundationDB gave a talk about why they went about building a DST simulator and he later went on to found Antithesis whose entire business is around trying to provide a generalizable DST engine to other companies to test their products!
Anyway, back to writing a much simpler test cluster in Raft!
While I was trying to write my Raft implementation, there was a very helpful Twitter reply about mocking away the network and the clock so that you’re essentially testing a deterministic state machine (Raft itself, however, isn’t deterministic).
In the previous post we saw an RPCManager
to facilitate communication and a tick
method to logically advance the time in the cluster. These methods are defined for each node.
This tick
function which I’ve reproduced below calls an advance_time_by
method, which in turn calls clock.advance
on a node. Each node has a clock
which is injected into it.
struct RaftNode<T: RaftTypeTrait, S: StateMachine<T>, F: RaftFileOps<T>> {
id: ServerId,
state: Mutex<RaftNodeState<T>>,
state_machine: S,
config: ServerConfig,
peers: Vec<ServerId>,
to_node_sender: mpsc::Sender<MessageWrapper<T>>,
from_rpc_receiver: mpsc::Receiver<MessageWrapper<T>>,
rpc_manager: CommunicationLayer<T>,
persistence_manager: F,
clock: RaftNodeClock,
}
impl RaftNodeClock {
fn advance(&mut self, duration: Duration) {
match self {
RaftNodeClock::RealClock(_) => (), // this method should do nothing for a real clock
RaftNodeClock::MockClock(clock) => clock.advance(duration),
}
}
}
fn advance_time_by(&mut self, duration: Duration) {
self.clock.advance(duration);
}
fn tick(&mut self) {
let mut state_guard = self.state.lock().unwrap();
match state_guard.status {
RaftNodeStatus::Leader => {
// the leader should send heartbeats to all followers
self.send_heartbeat(&state_guard);
}
RaftNodeStatus::Candidate | RaftNodeStatus::Follower => {
// the candidate or follower should check if the election timeout has elapsed
self.check_election_timeout(&mut state_guard);
}
}
drop(state_guard);
// listen to incoming messages from the RPC manager
self.listen_for_messages();
self.advance_time_by(Duration::from_millis(1));
}
I defined two variants of a clock - a mock and a real implementation to use for my cluster (another area I found where Rust’s enum pattern matching really shines).
use std::time::{Duration, Instant};
pub trait Clock {
fn now(&self) -> Instant;
}
pub struct RealClock;
impl Clock for RealClock {
fn now(&self) -> Instant {
Instant::now()
}
}
pub struct MockClock {
pub current_time: Instant,
}
impl MockClock {
pub fn advance(&mut self, duration: Duration) {
self.current_time += duration;
}
}
impl Clock for MockClock {
fn now(&self) -> Instant {
self.current_time
}
}
So any time I was creating a cluster for testing, I would create nodes with a MockClock
and for actual implementations (I never created those) I would use the RealClock
. You’ll notice that the RealClock
cannot have its time advanced.
Similar to the clocks, for the network layer I created a MockRPCManager
which contained a message queue indicating what messages were sent by the node for every logical tick of the cluster. And when a node wanted to send a message, it would simply get pushed into the sent_messages
queue.
There are also two additional helper methods called get_messages_in_queue
and replay_messages_in_queue
which allow me to either drain the messages or inspect them without mutating them. The get_messages
method drains the messages from a node’s queue and sends them across the network and replay_messages
keeps the messages as they are but allows for inspection.
Note: I picked up these ideas from an implementation on GitHub which you can find here
#[derive(Debug, Clone)]
struct MessageWrapper<T: RaftTypeTrait> {
from_node_id: ServerId,
to_node_id: ServerId,
message: RPCMessage<T>,
}
struct MockRPCManager<T: RaftTypeTrait> {
server_id: ServerId,
to_node_sender: mpsc::Sender<MessageWrapper<T>>,
sent_messages: RefCell<Vec<MessageWrapper<T>>>,
}
impl<T: RaftTypeTrait> MockRPCManager<T> {
fn new(server_id: ServerId, to_node_sender: mpsc::Sender<MessageWrapper<T>>) -> Self {
MockRPCManager {
server_id,
to_node_sender,
sent_messages: RefCell::new(vec![]),
}
}
}
impl<T: RaftTypeTrait> MockRPCManager<T> {
fn start(&self) {}
fn stop(&self) {}
fn send_message(&self, _to_address: String, message: MessageWrapper<T>) {
self.sent_messages.borrow_mut().push(message);
}
fn get_messages_in_queue(&mut self) -> Vec<MessageWrapper<T>> {
let mut mock_messages_vector: Vec<MessageWrapper<T>> = Vec::new();
for message in self.sent_messages.borrow_mut().drain(..) {
mock_messages_vector.push(message.clone());
}
mock_messages_vector
}
fn replay_messages_in_queue(&self) -> Ref<Vec<MessageWrapper<T>>> {
self.sent_messages.borrow()
}
}
The CommunicationLayer
trait defines what methods are available to a node to consume via its network layer.
trait Communication<T: RaftTypeTrait> {
fn start(&self);
fn stop(&self);
fn send_message(&self, to_address: String, message: MessageWrapper<T>);
}
enum CommunicationLayer<T: RaftTypeTrait> {
MockRPCManager(MockRPCManager<T>),
RPCManager(RPCManager<T>),
}
impl<T: RaftTypeTrait> Communication<T> for CommunicationLayer<T> {
fn start(&self) {
match self {
CommunicationLayer::MockRPCManager(manager) => manager.start(),
CommunicationLayer::RPCManager(manager) => manager.start(),
}
}
fn stop(&self) {
match self {
CommunicationLayer::MockRPCManager(manager) => manager.stop(),
CommunicationLayer::RPCManager(manager) => manager.stop(),
}
}
fn send_message(&self, to_address: String, message: MessageWrapper<T>) {
match self {
CommunicationLayer::MockRPCManager(manager) => {
manager.send_message(to_address, message)
}
CommunicationLayer::RPCManager(manager) => manager.send_message(to_address, message),
}
}
}
impl<T: RaftTypeTrait> CommunicationLayer<T> {
fn get_messages(&mut self) -> Vec<MessageWrapper<T>> {
match self {
CommunicationLayer::MockRPCManager(manager) => {
return manager.get_messages_in_queue();
}
CommunicationLayer::RPCManager(_) => {
panic!("This method is not supported for the RPCManager");
}
}
}
fn replay_messages(&self) -> Ref<Vec<MessageWrapper<T>>> {
match self {
CommunicationLayer::MockRPCManager(manager) => {
return manager.replay_messages_in_queue();
}
CommunicationLayer::RPCManager(_) => {
panic!("This method is not supported for the RPCManager");
}
}
}
}
Okay, now that we’ve got the mocking out of the way, let’s jump into defining the test cluster itself.
Here’s some basic definitions for my TestCluster
.
#[derive(Clone)]
pub struct ClusterConfig {
pub election_timeout: Duration,
pub heartbeat_interval: Duration,
pub ports: Vec<u64>,
}
pub struct TestCluster {
pub nodes: Vec<RaftNode<i32, KeyValueStore<i32>, DirectFileOpsWriter>>,
pub nodes_map:
BTreeMap<ServerId, RaftNode<i32, KeyValueStore<i32>, DirectFileOpsWriter>>,
pub message_queue: Vec<MessageWrapper<i32>>,
pub connectivity: HashMap<ServerId, HashSet<ServerId>>,
pub config: ClusterConfig,
}
I have a bunch of helper methods on this TestCluster
but it would be too much to go into all of them so I’ll just go over the tick
method and the time based methods here. If you want to see all the methods, check out the repo here.
The tick method on the cluster calls the get_messages
method we saw earlier for each node. It collects all these messages in a central queue and then dispatches them in the same logical tick. But, it dispatches the messages only after each node has gone through its own tick
function. There’s no special reason for the ordering of events here, it’s just how I decided to do it.
impl TestCluster {
pub fn tick(&mut self) {
// Collect all messages from nodes and store them in the central queue
self.nodes.iter_mut().for_each(|node| {
let mut messages_from_node = node.rpc_manager.get_messages();
self.message_queue.append(&mut messages_from_node);
});
// allow each node to tick
self.nodes.iter_mut().for_each(|node| {
node.tick();
});
// deliver all messages from the central queue
self.message_queue
.drain(..)
.into_iter()
.for_each(|message| {
let node = self
.nodes
.iter()
.find(|node| node.id == message.to_node_id)
.unwrap();
// check if these pair of nodes are partitioned
if self
.connectivity
.get_mut(&message.from_node_id)
.unwrap()
.contains(&message.to_node_id)
{
match node.to_node_sender.send(message) {
Ok(_) => (),
Err(e) => {
panic!(
"There was an error while sending the message to the node: {}",
e
)
}
};
}
});
}
The methods below allow me to advance logical time in my cluster for a bunch of different configurations. Either all nodes can advance by the same amount, a single node can advance by some amount, or all nodes can advance by some variable amount.
pub fn tick_by(&mut self, tick_interval: u64) {
for _ in 0..tick_interval {
self.tick();
}
}
pub fn advance_time_by_for_node(&mut self, node_id: ServerId, duration: Duration) {
let node = self
.nodes
.iter_mut()
.find(|node| node.id == node_id)
.expect(&format!("Could not find node with id: {}", node_id));
node.advance_time_by(duration);
}
pub fn advance_time_by_variably(&mut self, duration: Duration) {
// for each node in the cluster, advance it's mock clock by the duration + some random variation
for node in &mut self.nodes {
let jitter = rand::thread_rng().gen_range(0..50);
let new_duration = duration + Duration::from_millis(jitter);
node.advance_time_by(new_duration);
}
}
pub fn advance_time_by(&mut self, duration: Duration) {
// for each node in the cluster, advance it's mock clock by the duration
for node in &mut self.nodes {
node.advance_time_by(duration);
}
}
There are two more methods I found super helpful while doing the testing which are below - the partition
and heal_partition
methods. These allow me to simulate a network partition by stopping the flow of messages from one subset of nodes to another and later fixing that.
/// Separates the cluster into two smaller clusters where only the nodes within
/// each cluster can communicate between themselves
pub fn partition(&mut self, group1: &[ServerId], group2: &[ServerId]) {
for &node_id in group1 {
self.connectivity
.get_mut(&node_id)
.unwrap()
.retain(|&id| group1.contains(&id));
}
for &node_id in group2 {
self.connectivity
.get_mut(&node_id)
.unwrap()
.retain(|&id| group2.contains(&id));
}
}
/// Removes any partition in the cluster and restores full connectivity among all nodes
pub fn heal_partition(&mut self) {
let all_node_ids = self
.nodes
.iter()
.map(|node| node.id)
.collect::<HashSet<ServerId>>();
for node_id in all_node_ids.iter() {
self.connectivity.insert(*node_id, all_node_ids.clone());
}
}
And here’s what the code for creating, starting & stopping a cluster looks like.
pub fn new(number_of_nodes: u64, config: ClusterConfig) -> Self {
let mut nodes: Vec<RaftNode<i32, KeyValueStore<i32>, DirectFileOpsWriter>> =
Vec::new();
let nodes_map: BTreeMap<
ServerId,
RaftNode<i32, KeyValueStore<i32>, DirectFileOpsWriter>,
> = BTreeMap::new();
let node_ids: Vec<ServerId> = (0..number_of_nodes).collect();
let addresses: Vec<String> = config
.ports
.iter()
.map(|port| format!("127.0.0.1:{}", port))
.collect();
let mut id_to_address_mapping: HashMap<ServerId, String> = HashMap::new();
for (node_id, address) in node_ids.iter().zip(addresses.iter()) {
id_to_address_mapping.insert(*node_id, address.clone());
}
let mut counter = 0;
for (node_id, address) in node_ids.iter().zip(addresses.iter()) {
let server_config = ServerConfig {
election_timeout: config.election_timeout,
heartbeat_interval: config.heartbeat_interval,
address: address.clone(),
port: config.ports[counter],
cluster_nodes: node_ids.clone(),
id_to_address_mapping: id_to_address_mapping.clone(),
};
let state_machine = KeyValueStore::<i32>::new();
let persistence_manager = DirectFileOpsWriter::new("data", *node_id).unwrap();
let (to_node_sender, from_rpc_receiver) =
mpsc::channel::<MessageWrapper<i32>>();
let rpc_manager = CommunicationLayer::MockRPCManager(MockRPCManager::new(
*node_id,
to_node_sender.clone(),
));
let mock_clock = RaftNodeClock::MockClock(MockClock {
current_time: Instant::now(),
});
let node = RaftNode::new(
*node_id,
state_machine,
server_config,
node_ids.clone(),
persistence_manager,
rpc_manager,
to_node_sender,
from_rpc_receiver,
mock_clock,
);
nodes.push(node);
counter += 1;
}
let message_queue = Vec::new();
let mut connectivity_hm: HashMap<ServerId, HashSet<ServerId>> = HashMap::new();
for node_id in &node_ids {
connectivity_hm.insert(*node_id, node_ids.clone().into_iter().collect());
}
TestCluster {
nodes,
nodes_map,
message_queue,
connectivity: connectivity_hm,
config,
}
}
pub fn start(&mut self) {
for node in &mut self.nodes {
node.start();
}
}
pub fn stop(&self) {
for node in &self.nodes {
node.stop();
}
}
Now that we have a reasonable test harness set up, I’m going to dive into some specific test scenarios to explain how I use the test harness to move the cluster into a specific configuration and then test out certain scenarios.
Note: After spending time doing this, I understand why people prefer property based testing or fuzz testing as a methodology. Creating scenario based tests is a very time consuming process and doesn’t give you enough coverage to justify the time spent on it.
My tests aren’t exhaustive (and might even be buggy) and I genuinely don’t think scenario based testing is the way to get exhaustive coverage anyway. I have more tests in the repo if you want to look at those.
We’ll start with some simple tests for a single node cluster.
const ELECTION_TIMEOUT: Duration = Duration::from_millis(150);
const HEARTBEAT_INTERVAL: Duration = Duration::from_millis(50);
const MAX_TICKS: u64 = 100;
/// This test checks whether a node in a single cluster will become leader as soon as the election timeout is reached
#[test]
fn leader_election() {
let _ = env_logger::builder().is_test(true).try_init();
// create a cluster with a single node first
let cluster_config = ClusterConfig {
election_timeout: ELECTION_TIMEOUT,
heartbeat_interval: HEARTBEAT_INTERVAL,
ports: vec![8000],
};
let mut cluster = TestCluster::new(1, cluster_config);
cluster.start();
cluster.advance_time_by(ELECTION_TIMEOUT + Duration::from_millis(100 + 5)); // picking 255 here because 150 + a max jitter of 100 guarantees that election has timed out
cluster.wait_for_stable_leader(MAX_TICKS);
cluster.stop();
assert_eq!(cluster.has_leader(), true);
}
You can see that setting up the cluster earlier makes testing significantly easier here. I just call my helper methods on the cluster to move the cluster into a specific state and then assert whatever conditions I want against the cluster.
As an aside, you could convert this to a property based test using the proptest crate. You could change the number of nodes in the cluster to determine that a leader gets elected regardless of the number of nodes in the cluster. Something like the below code
extern crate proptest;
use proptest::prelude::*;
use std::time::Duration;
proptest! {
#![proptest_config(ProptestConfig::with_cases(10))]
fn leader_election_property_based_test(node_count in 1usize..5) {
let _ = env_logger::builder().is_test(true).try_init();
let cluster_config = ClusterConfig {
election_timeout: ELECTION_TIMEOUT,
heartbeat_interval: HEARTBEAT_INTERVAL,
ports: (8000..8000 + node_count as u16).collect(),
};
let mut cluster = TestCluster::new(node_count, cluster_config);
cluster.start();
cluster.advance_time_by(ELECTION_TIMEOUT + Duration::from_millis(100 + 5));
cluster.wait_for_stable_leader(MAX_TICKS * node_count as u64); // Adjusted wait time
cluster.stop();
assert_eq!(cluster.has_leader(), true);
}
}
Here’s a slightly more complicated test with 3 nodes that simulates a network partition in the cluster and asserts that a leader still gets elected in the majority partitioned cluster.
/// This test models the scenario where the leader in a cluster is network partitioned from the
/// rest of the cluster and a new leader is elected
#[test]
fn network_partition_new_leader() {
let _ = env_logger::builder().is_test(true).try_init();
let cluster_config = ClusterConfig {
election_timeout: ELECTION_TIMEOUT,
heartbeat_interval: HEARTBEAT_INTERVAL,
ports: vec![8000, 8001, 8002],
};
let mut cluster = TestCluster::new(3, cluster_config);
cluster.start();
cluster.advance_time_by_for_node(0, ELECTION_TIMEOUT + Duration::from_millis(50));
cluster.wait_for_stable_leader(MAX_TICKS);
// partition the leader from the rest of the group
let group1 = &[cluster.get_leader().unwrap().id];
let group2 = &cluster
.get_all_followers()
.iter()
.map(|node| node.id)
.collect::<Vec<ServerId>>();
cluster.partition(group1, group2);
cluster.advance_time_by_for_node(1, ELECTION_TIMEOUT + Duration::from_millis(100));
cluster.wait_for_stable_leader_partition(MAX_TICKS, group2);
cluster.stop();
assert_eq!(cluster.has_leader_in_partition(group2), true);
}
Again, you can see how useful the helper methods turn out to be because it’s much easier to separate one node from the rest by simulating a network partition and then checking certain properties on the individual partitions.
Now, here’s a much more complicated scenario-based test which models a network partition occurring between the leader and the rest of the cluster, the network partition healing and the old leader rejoining the rest of the cluster and having to catch up.
/// The test models the following scenario
/// 1. A leader is elected
/// 2. Client requests are received and replicated
/// 3. A partition occurs and a new leader is elected
/// 4. Clients requests are processed by the new leader
/// 5. The partition heals and the old leader rejoins the cluster
/// 6. The old leader must recognize its a follower and get caught up with the new leader
#[test]
fn network_partition_log_healing() {
let _ = env_logger::builder().is_test(true).try_init();
let cluster_config = ClusterConfig {
election_timeout: ELECTION_TIMEOUT,
heartbeat_interval: HEARTBEAT_INTERVAL,
ports: vec![8000, 8001, 8002],
};
let mut cluster = TestCluster::new(3, cluster_config);
// cluster starts and a leader is elected
cluster.start();
cluster.advance_time_by_for_node(0, ELECTION_TIMEOUT + Duration::from_millis(50));
cluster.wait_for_stable_leader(MAX_TICKS);
assert_eq!(cluster.has_leader(), true);
// client requests are received and replicated across cluster
let current_leader_term = cluster
.get_leader()
.unwrap()
.state
.lock()
.unwrap()
.current_term;
let mut last_log_index = cluster
.get_leader()
.unwrap()
.state
.lock()
.unwrap()
.log
.last()
.map_or(0, |entry| entry.index);
last_log_index += 1;
let log_entry_1 = LogEntry {
term: current_leader_term,
index: last_log_index,
command: LogEntryCommand::Set,
key: "a".to_string(),
value: 1,
};
last_log_index += 1;
let log_entry_2 = LogEntry {
term: current_leader_term,
index: last_log_index,
command: LogEntryCommand::Set,
key: "b".to_string(),
value: 2,
};
cluster.apply_entries_across_cluster(vec![&log_entry_1, &log_entry_2], MAX_TICKS);
cluster.verify_logs_across_cluster_for(vec![&log_entry_1, &log_entry_2], MAX_TICKS);
// partition and new leader election here
let group1 = &[cluster.get_leader().unwrap().id];
let group2 = &cluster
.get_all_followers()
.iter()
.map(|node| node.id)
.collect::<Vec<ServerId>>();
cluster.partition(group1, group2);
cluster.advance_time_by_for_node(1, ELECTION_TIMEOUT + Duration::from_millis(100));
cluster.wait_for_stable_leader_partition(MAX_TICKS, group2);
assert_eq!(cluster.has_leader_in_partition(group2), true);
// send requests to group2 leader
let log_entry_3 = {
let current_leader_term = cluster
.get_leader_in_cluster(group2)
.unwrap()
.state
.lock()
.unwrap()
.current_term;
let mut last_log_index = cluster
.get_leader()
.unwrap()
.state
.lock()
.unwrap()
.log
.last()
.map_or(0, |entry| entry.index);
last_log_index += 1;
let log_entry_3 = LogEntry {
term: current_leader_term,
index: last_log_index,
command: LogEntryCommand::Set,
key: "c".to_string(),
value: 3,
};
cluster.apply_entries_across_cluster_partition(
vec![&log_entry_3],
group2,
MAX_TICKS,
);
log_entry_3
};
// partition heals and old leader rejoins the cluster
// cluster needs to be verified to ensure all logs are up to date
let leader_id = cluster.get_leader_in_cluster(group2).unwrap().id;
cluster.heal_partition();
cluster.tick_by(MAX_TICKS);
cluster.wait_for_stable_leader(MAX_TICKS);
assert_eq!(cluster.get_leader().unwrap().id, leader_id);
cluster.verify_logs_across_cluster_for(
vec![&log_entry_1, &log_entry_2, &log_entry_3],
MAX_TICKS,
);
}
I think this is one of those situations where property-based testing would really shine not because it would make setting up the test simpler but because it would allow me to use the same test case and test the cluster under different scenarios.
For instance, I could use proptest
to generate clusters of different sizes, generate different variations of partitions and add a different number of log entries based on some generated value. This would give me more coverage.
So, in a sense, you could almost think of a single property test combining multiple scenario based tests.
I mentioned earlier that I’ve been digging into DST a little bit and with the help of some folks online, I’ve come to understand that DST is essentially about using a random seed value to generate property based tests while maintaining the ability to re-create a specific scenario using nothing but the seed value.
If we take the last test case we were discussing about generating clusters of different sizes, different variations of partitions and adding a different number of log entries, imagine if we did that with a randomly generated number which we can call a seed value.
This seed value would then be used to generate all the other values in a deterministic way. For instance, the number of nodes in the cluster could be the seed value itself, the number of log entries written could be seed_value * 2
and the the partition configuration could be defined by (seed_value % num_of_nodes) - 1
. This way, if the test case fails and the seed value is logged, I can re-create the test scenario using nothing but the seed value itself.
Since the seed value changes on every test run, I can test the system under a variety of different scenarios using property based testing. And in the case of a failure in a specific scenario, I am able to immediately re-create the scenario to debug the error.
I was under the impression that DST was a magic pill in the sense that if you wrote code that passed the simulator, you have bug free code. This is obviously incorrect. You’ve only reduced the probability of bugs in production by a significant factor (hopefully). In the video I linked above, Will Wilson explains how they considered writing a second DST simulator at FoundationDB because the engineers were getting really good at writing code that passed the simulator but still had bugs in it.
So there you have it. Writing a test harness for a distributed system is significantly more challenging than writing the system itself I think. I can see now why FoundationDB spent years building their simulator before writing their engine, although I don’t know if that’s a realistic approach for most people.
Still, it’s fun to dig into this stuff.
]]>Before starting, if you just want to look at the full code, go here
Raft is a famous distributed consensus algorithm which is implemented in various database engines now. The most famous is probably Yugabyte DB, a distributed SQL database.
First, we need to discuss what a distributed SQL database is and why we need distributed consensus on it. There are two classic concepts that come to mind when thinking of a distributed database - replication and sharding. Typically when you think of distributed consensus, its for a replicated database. That is, each copy of the database is identical.
Now, imagine that we have a distributed replicated SQL database with 3 nodes. Now there’s two ways to accept a write - any node can accept a write the way Amazon’s Dynamo does from its famous 2007 paper, or a single node is elected a leader and accepts a write and replicates it to the followers. It’s the second case a protocol like Raft was designed for.
The Raft paper was published in 2014, and you can read it here. It’s probably one of the more understandeable papers out there and can be grokked pretty easily. Accompanying the paper are two resources - the great Raft website and the even better Raft visualisation. The visualisation is seriously incredible!
The most important part of the paper is in a single page where the RPC’s are described in detail. This page is dense with information and I spent weeks poring over it, trying to understand what was going on.
Another super valuable resource is the TLA+ spec for Raft. This makes a world of difference if you’re trying to implement Raft.
If you’ve heard of Paxos, Raft is a simpler form of that. Let’s dive into the code!
Note: Before diving into the code, I want to emphasize that I do not have a formally correct specification because I haven’t run it against a formal test suite. I only have my own tests that I’ve run it against. Also, this was my first time writing anything significant in Rust, so excuse any poor code habits.
First, we define the basic structure for our node
struct RaftNodeState<T: Serialize + DeserializeOwned + Clone> {
// persistent state on all servers
current_term: Term,
voted_for: Option<ServerId>,
log: Vec<LogEntry<T>>,
// volatile state on all servers
commit_index: u64,
last_applied: u64,
status: RaftNodeStatus,
// volatile state for leaders
next_index: Vec<u64>,
match_index: Vec<u64>,
// election variables
votes_received: HashMap<ServerId, bool>,
election_timeout: Duration,
last_heartbeat: Instant,
}
struct RaftNode<T: RaftTypeTrait, S: StateMachine<T>, F: RaftFileOps<T>> {
id: ServerId,
state: Mutex<RaftNodeState<T>>,
state_machine: S,
config: ServerConfig,
peers: Vec<ServerId>,
to_node_sender: mpsc::Sender<MessageWrapper<T>>,
from_rpc_receiver: mpsc::Receiver<MessageWrapper<T>>,
rpc_manager: CommunicationLayer<T>,
persistence_manager: F,
clock: RaftNodeClock,
}
Each raft node has its own state that it needs to maintain. I’ve segregated this state based on whether its volatile, non-volatile (must be persisted) or is required only for a specific node state. The state is behind a mutex to allow access to it from multiple threads. Typically, you have a node listening for messages on a separate thread and invoking whatever RPC is required and you have the main thread of the node that is doing whatever operations it needs to do. In my test implementations I run everything off a single thread rather than putting the node behind an Arc::mutex
interface.
Each node also has a state machine attached to it. This state machine is a user provided state machine, it could be a database or a key value store etc. There needs to be one method on the state machine called apply()
or some variant that the cluster calls to update the state machine.
Now, nodes communicate with each other via JSON-over-TCP in my implementation. So, we need to set up the communication channel for them. I did this by defining an RPC manager which is then provided to each node.
struct RPCManager<T: RaftTypeTrait> {
server_id: ServerId,
server_address: String,
port: u64,
to_node_sender: mpsc::Sender<RPCMessage<T>>,
is_running: Arc<AtomicBool>,
}
The RPC manager has a few different methods defined on it - the constructor, a method to start, a method to stop and a function to allow a node to indicate to the manager that it wants to send a message to another node called send_message
.
All messages between nodes are serialized to JSON and deserialized from JSON using the serde
crate (Note: This obviously isn’t a practical implementation because typically you’d use your own protocol to avoid overhead).
Here’s an implementation of what the send_message
function looks like
/**
* This method is called from the raft node to allow it to communicate with other nodes
* via the RPC manager.
*/
fn send_message(&self, to_address: String, message: MessageWrapper<T>) {
info!(
"Sending a message from server {} to server {} and the message is {:?}",
message.from_node_id, message.to_node_id, message.message
);
let mut stream = match TcpStream::connect(to_address) {
Ok(stream) => stream,
Err(e) => {
panic!(
"There was an error while connecting to the server {} {}",
message.to_node_id, e
);
}
};
let serialized_request = match serde_json::to_string(&message.message) {
Ok(serialized_message) => serialized_message,
Err(e) => {
panic!("There was an error while serializing the message {}", e);
}
};
if let Err(e) = stream.write_all(serialized_request.as_bytes()) {
panic!("There was an error while sending the message {}", e);
}
}
Communication between the node and the RPC manager is handled via an MPSC channel in Rust. The receiver stays with the node and the transmitter is with the RPC manager.
Next, let’s dive into the functionality of a Raft cluster. Rather than focusing on each individual RPC function, I’ll go through RPC’s from the perspective of leader election and appending entries to the cluster.
A quick aside before jumping into the RPC’s themselves. We first need to discuss what a log entry itself looks like. Each log entry has the following structure. The most important fields here are term
(explained below) and index
(a 1-based index).
The LogEntryCommand
represents state machine related terminology. So here, it can be either Set
or Delete
.
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq)]
enum LogEntryCommand {
Set = 0,
Delete = 1,
}
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq)]
struct LogEntry<T: Clone> {
term: Term,
index: u64,
command: LogEntryCommand,
key: String,
value: T,
}
This is a core part of the Raft protocol. This allows a leader to be elected in each term or epoch of a Raft cluster. A term is a time period during which one leader reigns. When a new node wants to be elected leader, it needs to increment the term and ask for a vote. So, something akin to a logical clock.
There are 3 possible states every node can be in - a follower, a candidate or a leader. A candidate is just an intermediate state between a follower and a leader. Typically, nodes don’t stay in this state for very long.
There is also the concept of 2 timeouts in Raft - the election timeout and the heartbeat timeout. When these timeouts occur, there are certain state transitions that must occur. The image below is a good representation of that.
As an aside, if you’re curious why timeouts are required, look at the FLP section of this page. The FLP impossibility theorem is based on a famous paper from 1985. There is a generalisation of this theorem called the two generals problem which is fun to learn about.
Next, we need something to represent what a node is supposed to do as time passes. We’ll call this a tick
function.
This tick
function is something that the Raft cluster does every logical second. I say logical second because in a test cluster you typically want to control the passage of time to be able to create certain test scenarios.
fn tick(&mut self) {
let mut state_guard = self.state.lock().unwrap();
match state_guard.status {
RaftNodeStatus::Leader => {
// the leader should send heartbeats to all followers
self.send_heartbeat(&state_guard);
}
RaftNodeStatus::Candidate | RaftNodeStatus::Follower => {
// the candidate or follower should check if the election timeout has elapsed
self.check_election_timeout(&mut state_guard);
}
}
drop(state_guard);
// listen to incoming messages from the RPC manager
self.listen_for_messages();
self.advance_time_by(Duration::from_millis(1));
}
Here, if a node a is a candidate or a follower, it checks if the current election timeout has elapsed. If it has, it starts an election or restarts an election (if candidate).
If the node is a leader, however, it sends out something called a heartbeat. This is to ensure that the follower nodes don’t believe the leader has died and try to start a new election.
Let’s look at the RPC’s associated with leader election
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq)]
struct VoteRequest {
request_id: u16,
term: Term,
candidate_id: ServerId,
last_log_index: u64,
last_log_term: Term,
}
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq)]
struct VoteResponse {
request_id: u16,
term: Term,
vote_granted: bool,
candidate_id: ServerId
}
For now, focus on the fields term
, last_log_index
and last_log_term
. These are the fields that a node will use to determine whether its vote can be granted or not. The node that is starting an election will send out messages from itself to other nodes and will send the term and the index of its last log entry.
Imagine 2 nodes A and B where A is requesting a vote from B. If B has a higher term, this means that B is further ahead in the cluster’s time period and therefore it cannot accept A as a leader.
Furthermore, if B has a log entry with an index or term greater than the one sent by A in the request, the vote is rejected. Here’s the relevant part of the code
// this node has already voted for someone else in this term and it is not the requesting node
fn handle_vote_request(&mut self, request: &VoteRequest) -> VoteResponse {
// Code removed for brevity
if state_guard.voted_for != None && state_guard.voted_for != Some(request.candidate_id) {
debug!("EXIT: handle_vote_request on node {}. Not granting the vote because the node has already voted for someone else", self.id);
return VoteResponse {
request_id: request.request_id,
term: state_guard.current_term,
vote_granted: false,
candidate_id: self.id,
};
}
let last_log_term = state_guard.log.last().map_or(0, |entry| entry.term);
let last_log_index = state_guard.log.last().map_or(0, |entry| entry.index);
let log_check = request.last_log_term > last_log_term
|| (request.last_log_term == last_log_term && request.last_log_index >= last_log_index);
if !log_check {
return VoteResponse {
request_id: request.request_id,
term: state_guard.current_term,
vote_granted: false,
candidate_id: self.id,
};
}
// Code removed for brevity
}
Assuming that node B has not granted its vote for anyone in the current election cycle and the log check passes, it can return a successful vote response to A.
Now, let’s look at how vote responses are handled on the requesting node.
/// The quorum size is (N/2) + 1, where N = number of servers in the cluster and N is odd
fn quorum_size(&self) -> usize {
(self.config.cluster_nodes.len() / 2) + 1
}
fn can_become_leader(&self, state_guard: &MutexGuard<'_, RaftNodeState<T>>) -> bool {
let sum_of_votes_received = state_guard
.votes_received
.iter()
.filter(|(_, vote_granted)| **vote_granted)
.count();
let quorum = self.quorum_size();
sum_of_votes_received >= quorum
}
fn handle_vote_response(&mut self, vote_response: VoteResponse) {
let mut state_guard = self.state.lock().unwrap();
if state_guard.status != RaftNodeStatus::Candidate {
return;
}
if !vote_response.vote_granted && vote_response.term > state_guard.current_term {
// the term is different, update term, downgrade to follower and persist to local storage
state_guard.current_term = vote_response.term;
state_guard.status = RaftNodeStatus::Follower;
match self
.persistence_manager
.write_term_and_voted_for(state_guard.current_term, Option::None)
{
Ok(()) => (),
Err(e) => {
error!("There was a problem writing the term and voted_for variables to stable storage for node {}: {}", self.id, e);
}
}
return;
}
// the vote wasn't granted, the reason is unknown
if !vote_response.vote_granted {
return;
}
// the vote was granted - check if node achieved quorum
state_guard
.votes_received
.insert(vote_response.candidate_id, vote_response.vote_granted);
if self.can_become_leader(&state_guard) {
self.become_leader(&mut state_guard);
}
}
This part of the code shows a node receiving a vote and checking whether the sum of votes it has received so far exceeds the quorum required. A quorum is a majority of the nodes (which is why you typically run an odd number of nodes in a Raft cluster. I actually don’t know if you can run an even number of nodes because I don’t think the quorum calculations would work).
Now, once a leader has been elected it needs to ensure it stays the leader and it does via a mechanism called heartbeats. Essentially it sends a network request to every follower once every x
number of milliseconds to ensure that the followers don’t start a new election. Here’s the implementation.
fn send_heartbeat(&self, state_guard: &MutexGuard<'_, RaftNodeState<T>>) {
if state_guard.status != RaftNodeStatus::Leader {
return;
}
for peer in &self.peers {
if *peer == self.id {
// ignore self referencing node
continue;
}
let heartbeat_request = AppendEntriesRequest {
request_id: rand::thread_rng().gen(),
term: state_guard.current_term,
leader_id: self.id,
prev_log_index: state_guard.log.last().map_or(0, |entry| entry.index),
prev_log_term: state_guard.log.last().map_or(0, |entry| entry.term),
entries: Vec::<LogEntry<T>>::new(),
leader_commit_index: state_guard.commit_index,
};
let message = RPCMessage::<T>::AppendEntriesRequest(heartbeat_request);
let to_address = self.config.id_to_address_mapping.get(peer).expect(
format!("Cannot find the id to address mapping for peer id {}", peer).as_str(),
);
let message_wrapper = MessageWrapper {
from_node_id: self.id,
to_node_id: *peer,
message,
};
self.rpc_manager
.send_message(to_address.clone(), message_wrapper);
}
}
The heartbeat request is just an AppendEntry
request (more on that below).
Now, an interesting question is how do we decide what the value of x
should be above. The raft paper mentions that the time it takes for a leader to broadcast its heartbeats to all followers should be an order of magnitude less than the election timeout. The following excerpt is from the paper
Leader election is the aspect of Raft where timing is most critical. Raft will be able to elect and maintain a steady leader as long as the system satisfies the following timing requirement: broadcastTime ≪ electionTimeout ≪ MTBF In this inequality broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses; electionTimeout is the election timeout described in Section 5.2; and MTBF is the average time between failures for a single server. The broadcast time should be an order of magnitude less than the election timeout so that leaders can reliably send the heartbeat messages required to keep followers from starting elections
So, this leads me to conclude that the heartbeat timeout must be an experimental value based on the individual configuration of the cluster you are running. For my setup, I used a heartbeat interval of 50ms
. Because I am running a test cluster which works on a single server, this value just needs to be long enough to allow nodes to send RPC messages to other nodes via TCP sockets via the loopback interface.
Now, we come to the more challenging part of the protocol - log replication. First, we’ll tackle log replication on easy mode which assumes no failures
The happy path of log replication is fairly simple to understand. Let’s get the basic structures out of the way first. This is what the request and response looks like
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq)]
struct AppendEntriesRequest<T: Clone> {
request_id: u16,
term: Term,
leader_id: ServerId,
prev_log_index: u64,
prev_log_term: Term,
entries: Vec<LogEntry<T>>,
leader_commit_index: u64,
}
#[derive(Serialize, Deserialize, Debug, Clone, PartialEq)]
struct AppendEntriesResponse {
request_id: u16,
server_id: ServerId,
term: Term,
success: bool,
match_index: u64,
}
There’s a couple of new fields here which are important - leader_commit_index
in the request and match_index
in the follower. Trying to accurately parse what these variables do took me days but its actually quite simple. To accurately explain this, let me first explain the state a leader maintains.
On election, a leader initialises the following variables in its state - next_index
and match_index
, both of which are uint
vectors and maintain an index value for each follower. The next_index
represents the next log entry the leader is going to try and replicate and the match_index
represents the log entry up to which the leader knows each follower has replicated the entries. So, next_index
must stay ahead of match_index
, that’s one of the invariants of the Raft state machine.
Now, when a leader sends out an AppendEntry
RPC to each follower, the follower responds with a match_index
indicating up to where it has replicated the log and the leader_commit_index
is used by the leader to determine which log entries can be applied to its own state machine. Once these entries are applied to the state machine, the leader updates the leader_commit_index
.
With every request the leader sends out, it includes the leader_commit_index
so that followers can know which logs they can commit to their own state machine. This is an important invariant to maintain because once a log entry has been committed to a state machine it can never be revoked. This is an important safety property of Raft.
Okay, let’s look at some code. First, we’ll look at the easier case of a leader node getting a successful response back from a follower.
When a response comes in from a follower, if the response was successful, the leader checks what the match_index
the follower sent was. It updates its own state to that value and checks which is the maximum value among its followers that has quorum. It is allowed to commit this value to its own state machine and update its own commit index.
fn handle_append_entries_response(&mut self, response: AppendEntriesResponse) {
let mut state_guard = self.state.lock().unwrap();
if state_guard.status != RaftNodeStatus::Leader {
return;
}
let server_index = response.server_id as usize;
if !response.success {
// reduce the next index for that server and try again
state_guard.next_index[server_index] = state_guard.next_index[server_index]
.saturating_sub(1)
.max(1);
self.retry_append_request(&mut state_guard, server_index, response.server_id);
return;
}
// the response is successful
state_guard.next_index[(response.server_id) as usize] = response.match_index + 1;
state_guard.match_index[(response.server_id) as usize] = response.match_index;
self.advance_commit_index(&mut state_guard);
self.apply_entries(&mut state_guard);
}
fn advance_commit_index(&self, state_guard: &mut MutexGuard<'_, RaftNodeState<T>>) {
assert!(state_guard.status == RaftNodeStatus::Leader);
// find all match indexes that have quorum
let mut match_index_count: HashMap<u64, u64> = HashMap::new();
for &server_match_index in &state_guard.match_index {
*match_index_count.entry(server_match_index).or_insert(0) += 1;
}
let new_commit_index = match_index_count
.iter()
.filter(|(&match_index, &count)| {
count >= self.quorum_size().try_into().unwrap()
&& state_guard
.log
.get((match_index as usize).saturating_sub(1))
.map_or(false, |entry| entry.term == state_guard.current_term)
})
.map(|(&match_index, _)| match_index)
.max();
if let Some(max_index) = new_commit_index {
state_guard.commit_index = max_index;
}
}
If the append entry request fails, the leader checks to see if there is an earlier log entry that the follower might accept and retries the append entry request with that log entry.
Next, lets look at a follower node handling an append entries request from a leader.
fn handle_append_entries_request(
&mut self,
request: &AppendEntriesRequest<T>,
) -> AppendEntriesResponse {
let mut state_guard = self.state.lock().unwrap();
// the term check
if request.term < state_guard.current_term {
return AppendEntriesResponse {
request_id: request.request_id,
server_id: self.id,
term: state_guard.current_term,
success: false,
match_index: state_guard.log.last().map_or(0, |entry| entry.index),
};
}
self.reset_election_timeout(&mut state_guard);
// log consistency check - check that the term and the index match at the log entry the leader expects
let log_index_to_check = request.prev_log_index.saturating_sub(1) as usize;
let prev_log_term = state_guard
.log
.get(log_index_to_check)
.map_or(0, |entry| entry.term);
let prev_log_index = state_guard
.log
.get(log_index_to_check)
.map_or(0, |entry| entry.index);
let log_ok = request.prev_log_index == 0
|| (request.prev_log_index > 0
&& request.prev_log_index <= state_guard.log.len() as u64
&& request.prev_log_term == prev_log_term);
// the log check is not OK, give false response
if !log_ok {
return AppendEntriesResponse {
request_id: request.request_id,
server_id: self.id,
term: state_guard.current_term,
success: false,
match_index: state_guard.log.last().map_or(0, |entry| entry.index),
};
}
if request.entries.len() == 0 {
// this is a heartbeat message
state_guard.commit_index = request.leader_commit_index;
self.apply_entries(&mut state_guard);
return AppendEntriesResponse {
request_id: request.request_id,
server_id: self.id,
term: state_guard.current_term,
success: true,
match_index: state_guard.log.last().map_or(0, |entry| entry.index),
};
}
// if there are any subsequent logs on the follower, truncate them and append the new logs
if self.len_as_u64(&state_guard.log) > request.prev_log_index {
state_guard.log.truncate(request.prev_log_index as usize);
state_guard.log.extend_from_slice(&request.entries);
if let Err(e) = self
.persistence_manager
.append_logs_at(&request.entries, request.prev_log_index.saturating_sub(1))
{
error!("There was a problem appending logs to stable storage for node {} at position {}: {}", self.id, request.prev_log_index - 1, e);
}
return AppendEntriesResponse {
request_id: request.request_id,
server_id: self.id,
term: state_guard.current_term,
success: true,
match_index: state_guard.log.last().map_or(0, |entry| entry.index),
};
}
// There are no subsequent logs on the follower, so no possibility of conflicts. Which means the logs can just be appended
// and the response can be returned
state_guard.log.append(&mut request.entries.clone());
if let Err(e) = self.persistence_manager.append_logs(&request.entries) {
error!(
"There was a problem appending logs to stable storage for node {}: {}",
self.id, e
);
}
AppendEntriesResponse {
request_id: request.request_id,
server_id: self.id,
term: state_guard.current_term,
success: true,
match_index: state_guard.log.last().map_or(0, |entry| entry.index),
}
}
The follower first makes basic sanity checks - am i in a new epoch? No, okay continue. Does the log check pass? Okay, continue.
Once the basic sanity checks have passsed the leader then checks if it has any subsequent logs that the leader has not included in the request. This is tricky to grasp so look at the image below for a moment
The leader for term 8 has a max log index of 10 but it has followers which have a log index that is greater than its own maximum. So, what does the follower do in this case? The simple answer is that as long as the entries have not been committed to the state machine on the follower, they can be replaced which is exactly what the code does! It just removes everything past the point that is available in the leader’s request and replaces it with the leader’s entries. [Note: This is actually incorrect, because this could be an outdated RPC from the leader which might not contain the latest entries. But, that does not mean they should be removed. See Jon Gjengset’s guide to Raft here for more details].
Once all these checks have been taken care of, the follower is free to return a successful response. Once the follower returns a successful response, the logic above this section runs to check if the leader can update its own commit_index
and apply entries to its state machine.
This is probably the part of the implementation I did the most inefficient job on because I’m just not very used to working with file system semantics. Also, Rust being a new language to me probably didn’t help.
I hid the file system mechanics behind an API because I was sure that I’d be changing it in the future, which is why you’ll probably notice that its outside of the main.rs
file. Here’s the interface I used for the file operations
pub trait RaftFileOps<T: Clone + FromStr + Display> {
fn read_term_and_voted_for(&self) -> Result<(Term, ServerId), io::Error>;
fn write_term_and_voted_for(
&self,
term: Term,
voted_for: Option<ServerId>,
) -> Result<(), io::Error>;
fn read_logs(&self, log_index: u64) -> Result<Vec<LogEntry<T>>, io::Error>;
fn append_logs(&mut self, entries: &Vec<LogEntry<T>>) -> Result<(), io::Error>;
fn append_logs_at(
&mut self,
entries: &Vec<LogEntry<T>>,
log_index: u64,
) -> Result<(), io::Error>;
}
pub struct DirectFileOpsWriter {
file_path: String,
file: RefCell<Option<File>>,
}
I use a RefCell
for the DirectFileOpsWriter
because I didn’t want to deal with Rust’s compiler complaining about multiple mutable borrows when I’m sure that a mutable borrow on these methods will not affect anything on the Raft node itself. I’m still not a 100% sure that this is the “correct” approach in Rust. Perhaps I need to re-design my structure itself to avoid using this pattern.
The format I chose to write was a CSV style format with separate lines for each entry. So, it looks something like this
1,2 <- This is the term and voted_for entries. The node voted in term 1 for node 2
1,1,0,1,1 <- The tuple shows (term,log_index,command,key,value)
The first line of the file can be repeatedly modified because for every term the node can either vote for itself or for another node. I suppose a different implementation could maintain a history of the node voted for in each term but the Raft spec does not make any assertions about the file format. Some educational implementations I see don’t even write or retrieve from storage. In fact, the Raft spec doesn’t include anything about storage, but of course you need to be able to write and retrieve from storage to make the implementation actually work otherwise how you would you recover from a crash?
Here’s some code around how the node restarts from stable storage.
fn restart(&mut self) {
let mut state_guard = self.state.lock().unwrap();
state_guard.status = RaftNodeStatus::Follower;
state_guard.next_index = vec![1; self.config.cluster_nodes.len()];
state_guard.match_index = vec![0; self.config.cluster_nodes.len()];
state_guard.commit_index = 0;
state_guard.votes_received = HashMap::new();
let (term, voted_for) = match self.persistence_manager.read_term_and_voted_for() {
Ok((t, v)) => (t, v),
Err(e) => {
panic!(
"There was an error while reading the term and voted for {}",
e
);
}
};
state_guard.current_term = term;
state_guard.voted_for = Some(voted_for);
let log_entries = match self.persistence_manager.read_logs(1) {
Ok(entries) => entries,
Err(e) => {
panic!("There was an error while reading the logs {}", e);
}
};
state_guard.log = log_entries;
}
Setting up a test rig to find bugs in this implementation was the most challenging part of writing the Raft implementation. I’m going to cover that in a separate post because that’s quite involved and it was more challenging to understand that then actually write the Raft implementation itself.
There is a lot I want to do in my implementation still, not least changing the way I write to and retrieve from storage. Another few things that most people do in their implementations:
There’s quite a few useful posts and repos out there that helped me significantly while doing this implementation. Here are a few of those.
You have a regular hash table. Everyone knows what that looks like, but just to kick off the visualisation, it’s shown below.
It’s pretty simple, you take a key, hash it and then get a hashed value which will lead you to some index. The index you go to based on the hash is implementation dependent - the hash could return an int, you could take a set of bits from the value etc.
You have the standard problems with this hash table of collisions occurring so you have different methods of dealing with it:
The hash table variations above typically don’t do well with large volumes of data, which is what is required in databases. You need a dynamic data structure that can grow and shrink to handle changes in data and can support high throughput in a concurrent environment.
Note: Hash tables seem to be typically used to create table indexes in databases, but typically B-trees are preferred here because hash tables don’t support range queries
It’s these two things that extendible hash tables do well - when they need to grow and shrink, they do so locally (i’ll explain below) and can support fine-grained locking for concurrency.
There are 3 things to keep track of in an extendible hash table - a header, a directory and a bucket. A header allows you to index into a directory and a directory allows you to index into a bucket. The image below shows an extendible hash table where each directory indexes into a unique bucket.
A header maintains a max depth, a directory maintains a global depth and a bucket maintains a local depth. The image below shows a separate header slot in dotted lines but that’s just to illustrate the difference.
The complexity with extendible hash tables comes with bookkeeping. A dynamic data structure that can grow and shrink on demand typically has a lot of invariants to track and programmer discipline is required to ensure they are maintained.
There were three variables in the upper image: MD (maximum depth), GD (global depth) and LD (local depth). All variants that need to be maintained center around these two variables.
2^(max depth)
2^(global depth)
.2^(global depth - local depth)
.Again, it’s easier to understand this when you see an image depicting this.
The image above has two directory slots pointing to one bucket because the global depth is 2 and the local depth of the bucket is 1.
When an extendible hash table grows, it’s because of a bucket split. The condition under which a bucket splits is implementation dependent - it could be half-full or full. And when a bucket splits, it’s local depth increases by 1. So, if there was bucket with a local depth of x, after the split there will be 2 buckets with a local depth of x+1 each.
We’ll split the hash table from the previous image to illustrate
The invariants are still maintained, specifically the third one. The number of directory slots indexing to a specific bucket are 2^(global depth - local depth)
.
We split a bucket and we have 4 directory slots pointing to an individual bucket each. Now, what happens if a bucket overflows and needs to split again? We do what a regular hash table does and grow the directory by a power of 2. But, this is where data structures like this get tricky - we have to ensure our invariants are maintained across state transitions.
To grow the directory, we need to increment the global depth to ensure that invariant number 2 is maintained. That’s easy enough, if the global depth now is 2^2 = 4
, it becomes 2^3 = 8
.
Next, we need to ensure the bucket invariant(invariant number 3) is maintained, which is slightly tricker. For this, I’m going to take a shortcut and explain why the shortcut works later. The shortcut is that I’m going to mirror the pre-existing half of the directory after growing it.
Okay, now we have growth out of the way, it’s far easier to understand shrinkage because it’s the mirror opposite. A bucket can merge with another bucket in which case the local depth of the bucket decreases and a directory can shrink (halve in size) as long as all the invariants are maintained.
This section is about how to actually traverse the extendible hash table and we’ll also figure out what the use case for the global depth and the local depth is beyond just keeping track of the number of available slots.
You have some key which you hash with some uniformly distributed hash function and the resulting hash can be represented in 32 bits. You take the max depth number of most significant bits to find the correct directory from the header and the global depth number of least significant bits to find the correct bucket from the directory.
Here’s an illustration.
The example above assumes that the hash of the key 2 is the value 2 itself and in binary this would be 00….10 with 00 being the two MSB and 10 being the two LSB. We use the two MSB and the two LSB because that’s the max depth of the header and the global depth of the directory.
Now, if the directory grows and the global depth increases, the great thing is that none of the values have to be re-shuffled. Again, it’s easier to see why.
The number of bits we consider for the header is still 2 because the header depth hasn’t changed. The number of bits we consider for the directory is now 3 because it has increased by 1. But, given a sufficiently large number of bits (like 32), the bits at the end will still lead to index 2 (010) for the number 2.
Earlier I mentioned that when a directory is doubled, an easy trick is to mirror the first half of the directory.
The reason that works is shown in the image above. When the global depth of a directory is 3, you use the 3 LSB of a hash to determine which directory slot to look at.
With the key 2, the 3 LSB are 010
which leads to bucket 10
(the bucket has a local depth of 2). Now, with a key 6, who has is 000....110
, this would lead you to directory slot 110
which would also lead you to bucket 10
.
The directory, with a depth of 3, is using 3 bits of resolution. All the buckets, with local depths of 2, are using 2 bits of resolution. So, any hashes having the same 2 LSB but differing in the 3rd LSB will start off in different directory slots but end up in the same bucket.
This is why the trick of mirroring the directory works. In the image, every new directory slot differs from its mirror image bucket only in the 3rd LSB, so it will always point to the same bucket as its mirror image.
The directory slots 2(010)
and 6(110)
will point to the same bucket with a local depth of 2 bits because their last 2 bits are identical.
Now, this is the real meat of data structures like extendible hash tables. How do you enable multiple threads (both reader sand writers) to traverse a data structure like this concurrently?
There’s a technique called latch crabbing which the animation below illustrates. The basic idea is that your threads move across the data structure like a crab - they hook onto one component (the header or the directory) and then try to hook onto the second component (directory or bucket) without letting go of the first component. Only after they have successfully hooked onto the second component, do they let go of the first.
The animation below shows two threads (one reader, one writer) walking down the table using latch crabbing. The reader fails because the writer acquires an exclusive lock first.
Typically, latch crabbing works by allowing each thread to acquire a shared latch on a component until the thread absolutely requires an exclusive latch on some component. This is necessitated by a write into a bucket for hash tables. However, the tricky situation is what to do when a thread wants to acquire an exclusive latch on some component but another thread already has a shared lock on that component.
You saw the situation in the animation above where the read thread wanted to acquire a read latch on a bucket but the write thread acquired the write latch first.
There’s two ways to address this - either you give preference to readers or to writers.
Assume that you wanted to give writer threads first preference because a write is a more expensive operation and it’s easier for readers to start over and traverse the data structure. In this case, you would write code like the following for the reader thread and writer thread respectively.
auto read_thread(int key, int value) -> void {
std::shared_lock<std::shared_mutex> header_lock(this->header_mutex_, std::defer_lock);
if (!header_lock.try_lock()) {
return read_thread(key, value);
}
// some logic
// ......
std::shared_lock<std::shared_mutex> directory_lock(this->directory_mutex_, std::defer_lock);
if (!directory_lock.try_lock()) {
header_lock.unlock();
return read_thread(key, value);
}
// some logic
// ......
std::unique_lock<std::shared_mutex> bucket_lock(this->bucket_mutex_, std::defer_lock);
if (!bucket_lock.try_lock()) {
directory_lock.unlock();
header_lock.unlock();
return read_thread(key, value);
}
}
auto write_thread(int key, int value) -> void {
std::shared_lock<std::shared_mutex> header_lock(this->header_mutex_);
// some logic
// ......
std::shared_lock<std::shared_mutex> directory_lock(this->directory_mutex_);
header_lock.unlock();
// some logic
// ......
std::unique_lock<std::shared_mutex> bucket_lock(this->bucket_mutex_);
// some logic
if (no_directory_change_required) {
directory_lock.unlock();
// write to bucket
return;
}
}
At each point where the next lock cannot be acquired in shared mode, the reader thread gives up existing locks and restarts the read operation (to reduce lock contention). When a write thread cannot acquire a lock, the scheduler puts the thread to sleep. Once the mutex is available, the thread wakes up and acquires the mutex (there are a lot of details I’m glossing over here).
Now, the code snippet above doesn’t illustrate what happens if a directory needs to change. Assume a bucket needs to split and the directory needs to grow to accomodate that.
The reader thread stays the same but let’s add some logic to the writer thread.
auto write_thread(int key, int value) -> void {
std::shared_lock<std::shared_mutex> header_lock(this->header_mutex_);
// some logic
// ......
std::shared_lock<std::shared_mutex> directory_lock(this->directory_mutex_);
header_lock.unlock();
// some logic
// ......
std::unique_lock<std::shared_mutex> bucket_lock(this->bucket_mutex_);
// some logic
if (no_directory_change_required) {
directory_lock.unlock();
// write to bucket
return;
}
// directory change required
directory_lock.unlock();
// modify and write to bucket
return;
}
This is a pretty simple logic extension but there’s a nasty bug hiding in here. Imagine the following series of events.
+--------------------------------------------------+
| Deadlock Illustration |
+-------------------------+------------------------+
| Thread 1 (Read) | Thread 2 (Write) |
+-------------------------+------------------------+
| header.shared_lock() | |
| | header.shared_lock() |
| | |
| | directory.shared_lock()|
| | |
| | bucket.unique_lock() |
| | |
| | directory.unlock() |
| directory.shared_lock() | |
| | |
| *waiting for bucket* | *waiting for dir lock* |
+-------------------------+------------------------+
The write thread acquires a shared lock on the directory first and then immediately after acquires a unique lock on the bucket. But, once it realises modifications need to be made to the directory, it needs to give up the shared lock before acquiring the unique lock. However, in the time span between giving up the shared lock and acquiring the unique lock, the read thread acquires the shared lock.
Now, the read thread is waiting to acquire a shared lock on the bucket, which it can’t because the write thread has a unique lock on it and the write thread is waiting to acquire a unique lock on the directory, which it can’t because the read thread has a shared lock on it.
The problem is that the code didn’t follow lock ordering, which says that locks are relinquished in the reverse order of acquisition
What actually needs to be done is the following
auto write_thread(int key, int value) -> void {
std::shared_lock<std::shared_mutex> header_lock(this->header_mutex_);
// some logic
// ......
std::shared_lock<std::shared_mutex> directory_lock(this->directory_mutex_);
header_lock.unlock();
// some logic
// ......
std::unique_lock<std::shared_mutex> bucket_lock(this->bucket_mutex_);
// some logic
if (no_directory_change_required) {
directory_lock.unlock();
// write to bucket
return;
}
// directory change required
// release the locks -------- (1)
bucket_lock.unlock()
directory_lock.unlock();
// re-acquire the locks ------- (2)
std::unique_lock<std::shared_mutex> directory_unique_lock(this->directory_mutex_);
std::unique_lock<std::shared_mutex> bucket_lock(this->bucket_mutex_);
if (no_directory_change_required) { // --------- (3)
directory_lock.unlock();
// write to bucket
return;
}
// modify and write to bucket
return;
}
Here, there are 3 important things to look at:
The third condition is important because the data structure could potentially have changed from under you in the time it took you to release and re-acquire the locks.
The reason for all this complexity is because there is no atomic way to upgrade a read lock to a write lock in C++ (in fact, I’m not sure what it would even mean to upgrade a lock atomically while doing latch crabbing or how much it would necessarily help).
Note: Boost does have a way to upgrade a lock, see here. But, that still wouldn’t completely solve the problem because a lock can’t be ugpraded in the presence of a shared lock anyway. It would, however, simplify things slightly.
This was an interesting rabbit hole I jumped down. I’m still not quite clear what it means to atomically upgrade a lock. I’ve seen one implementation here which involves upgrading a mutex by draining all current readers by preventing any new read acquisitions.
There was also a pretty interesting Reddit thread about upgrading a mutex which got some interesting comments.
These are some useful references
]]>Hopefully, you leave with a decent understanding of how memory ordering works and how to use atomics in conjunction with memory ordering to build a lock-free queue in C++.
Note: If you want to actually compile the code and run it, make sure to do so with the TSan flag enabled for the CLang compiler. TSan is a reliable way of detecting data races in your code, instead of trying to repeatedly run the code in a loop hoping for a data race to occur.
Imagine you have either some concurrent code operating on some shared data in memory. You have two threads or processes, one writing and one reading on some shared piece of state.
The “safe” way of dealing with this is mutexes. However, mutexes tend to add overhead. Atomics are a more performant but much more complicated way of dealing with concurrent operations.
This section is very simple. Atomics are simply operations or instructions that cannot be split by the compiler or the CPU or re-ordered in any way.
The simplest possible example of an atomic in C++ is an atomic flag
#include <atomic>
std::atomic<bool> flag(false);
int main() {
flag.store(true);
assert(flag.load() == true);
}
We define an atomic boolean, initialise it and then call store
on it. This method sets the value on the flag. You can then load
the flag from memory and assert its value.
The operations you can perform with atomics are straightforward:
store()
method. This is a write operation.load()
method. This is a read operation,compare_exchange_weak()
or compare_exchange_strong()
method. This is a read-modify-write(RMW) operation.The important thing to remember is that each of these cannot be split into separate instructions.
Note: There are more methods available, but this is all we need for now
There are various atomics available in C++ and you can use them in combination with memory orderings.
This section is a lot more complicated and is the meat of the matter. There are some great references for understanding this I’ve linked at the bottom.
The compiler and CPU are capable of re-ordering your program instructions, often independently of one another. That is, your compiler can re-order instructions and your CPU can re-order instructions again. See here.
However, this is only allowed if the compiler can definitely not establish a relationship between the two sets of instructions.
For instance, this can be re-ordered because there is no relationship between the assignment to x and the assignment to y. That is, the compiler or CPU might assign y first and then x. But, this doesn’t change the semantic meaning of your program.
int x = 10;
int y = 5;
However, the code below cannot be re-ordered because the compiler cannot establish the absence of a relationship between x and y. It’s obvious to see here because y depends on the value of x.
int x = 10;
int y = x + 1;
It doesn’t seem like a big problem until there’s multi-threaded code.
#include <cassert>
#include <thread>
int data = 0;
void producer() {
data = 100; // Write data
}
void consumer() {
assert(data == 100);
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
return 0;
}
The multi-threaded example above will fail to compile with TSan because there is a clear data race when thread 1 is trying to set the value of data and thread 2 is trying to read the value of data. The easy answer here is a mutex to protect the write & read of data but there’s a way to do this with an atomic boolean.
We loop on the atomic boolean until we find that it is set to the value we are looking for and then check the the value of data
.
#include <atomic>
#include <cassert>
#include <thread>
int data = 0;
std::atomic<bool> ready(false);
void producer() {
data = 100;
ready.store(true); // Set flag
}
void consumer() {
while (!ready.load())
;
assert(data == 100);
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
return 0;
}
When you compile this with TSan, it doesn’t complain about any race conditions. Note: I’m going to refer back to why TSan doesn’t complain here.
Now, I’m going to break it by adding a memory ordering guarantee to it. Just replace ready.store(true);
with ready.store(true, std::memory_order_relaxed);
and replace while (!ready.load())
with while (!ready.load(std::memory_order_relaxed))
.
TSan will complain that there is a data race. But, why is it complaining?
The issue here is that we have no order among the operations of the two threads anymore. The compiler or CPU is free to re-order instructions in the two threads. If we go back to our abstract visualisation from earlier, this is what it looks like now.
The visualisation above shows us that our two processes (threads) have no way of agreeing upon the current state or the order in which that state has changed.
Once process 2 determines that the flag has been set to true, it tries to read the value of data. But, thread 2 believes that the value of data has not yet changed even though it believes the value of flag has been set to true.
This is confusing because the classic model of interleaving concurrent operations doesn’t apply here. In the classic model of concurrent operations, there is always some order that can be established. For instance, we can say that this is one possible scenario of operations.
Thread 1 Memory Thread 2
--------- ------- ---------
| | |
| write(data, 100) | |
| -----------------------> | |
| | load(ready) == true |
| | <---------------------- |
| | |
| store(ready, true) | |
| -----------------------> | |
| | |
| | read(data) |
| | <---------------------- |
| | |
But, the above graphic assumes that both threads have agreed upon some global order of events, which isn’t true at all anymore! This is still confusing for me to wrap my head around.
In a memory_order_relaxed
mode, two threads have no way of agreeing upon the order of operations on shared variables. From thread 1’s point of view, the operations it executed were
write(data, 100)
store(ready, true)
However, from thread 2’s point of view, the order of operations it saw thread 1 execute were
store(ready, true)
write(data, 100)
Without agreeing upon the order in which operations occurred on shared variables, it isn’t safe to make changes to those variables across threads.
Okay, let’s fix the code by replacing std::memory_order_relax
with std::memory_order_seq_cst
.
So, ready.store(true, std::memory_order_relaxed);
becomes ready.store(true, std::memory_order_seq_cst);
and while (!ready.load(std::memory_order_relaxed))
becomes while (!ready.load(std::memory_order_seq_cst))
.
If you run this again with TSan, there are no more data races. But why did that fix it?
So, we saw our problem earlier was about two threads being unable to agree upon a single view of events and we wanted to prevent that. So, we introduced a barrier using sequential consistency.
Thread 1 Memory Thread 2
--------- ------- ---------
| | |
| write(data, 100) | |
| -----------------------> | |
| | |
| ================Memory Barrier=================== |
| store(ready, true) | |
| -----------------------> | |
| | load(ready) == true |
| | <---------------------- |
| ================Memory Barrier=================== |
| | |
| | read(data) |
| | <---------------------- |
| | |
The memory barrier here says that nothing before the store operation and nothing after the load operation can be re-ordered. That is, thread 2 now has a guarantee that the compiler or the CPU will not place the write to data after the write to the flag in thread 1. Similarly, the read operation in thread 2 cannot be re-ordered above the memory barrier.
The region inside the memory barrier is akin to a critical section that a thread needs to a mutex to enter. We now have a way to synchronise the two threads on the order of events across them.
This brings us back to our classic model of interleaving in concurrency because we now have an order of events both threads agree upon.
There are 3 main types of memory order:
We already covered 1 & 3 in the examples above. The second memory model literally lies between the other two in terms of consistency.
#include <atomic>
#include <cassert>
#include <iostream>
#include <thread>
int data = 0;
std::atomic<bool> ready(false);
void producer() {
data = 100;
ready.store(true, std::memory_order_release); // Set flag
}
void consumer() {
while (!ready.load(std::memory_order_acquire))
;
assert(data == 100);
}
int main() {
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
return 0;
}
The above is the same example from earlier, expect with std::memory_order_release
used for ready.store()
and memory_order_acquire
used for read.load()
. The intuition here for ordering is similar to the pervious memory barrier example.
Except, this time the memory barrier is formed on the pair of ready.store()
and ready.load()
operations and will only work when used on the same atomic variable across threads. Assuming you have a variable x
being modified across 2 threads, you could do x.store(std::memory_order_release)
in thread 1 and x.load(std::memory_order_acquire)
in thread 2 and you would have a synchronization point across the two threads on this variable.
The difference between the sequentially consistent model and the release-acquire model is that the former enforces a global order of operations across all threads, while the latter enforces an order only among pairs of release and acquire operations.
Now, we can revisit why TSan didn’t complain about a data race initially when there was no memory order specified. It’s because C++ by default assumes a std::memory_order_seq_cst
when no memory order is specified. Since this is the strongest memory model, there is no data race possible.
Different memory models have different performance penalties on different hardware.
For instance, the x86 architectures instruction set implements something called total store ordering (TSO). The gist of it is that the model resembles all threads reading and writing to a shared memory. You can read more here
This means that the x86 processors can provide sequential consistency for a relatively low computational penalty.
On the other side are the ARM family of processors has a weakly ordered instruction set architecture. This is because each thread or process reads and writes to its own memory. Again, the link above provides context.
This means that the ARM processors provide sequential consistency for a much higher computational penalty.
I’m going to use the operations we have discussed so far to build the basic operations of a lock-free concurrent queue. This is by no means even a complete implementation, just my attempt to re-create something basic using atomics.
I’m going to represent the queue using a linked list and wrap each node inside an atomic.
class lock_free_queue {
private:
struct node {
std::shared_ptr<T> data;
std::atomic<node*> next;
node() : next(nullptr) {} // initialise the node
};
std::atomic<node*> head;
std::atomic<node*> tail;
}
Now, for the enqueue operation, this is what its going to look like
void enqueue(T value) {
std::shared_ptr<T> new_data = std::make_shared<T>(value);
node* new_node = new node();
new_node->data = new_data;
// do an infinite loop to change the tail
while (true) {
node* current_tail = this->tail.load(std::memory_order_acquire);
node* tail_next = current_tail->next;
// everything is correct so far, attempt the swap
if (current_tail->next.compare_exchange_strong(
tail_next, new_node, std::memory_order_release)) {
this->tail = new_node;
break;
}
}
}
The main focus is on the load
and compare_exchange_strong
operations. The load
works with a acquire
and the CAS works with a release
so that reads and writes to the tail are synchronized.
Similarly for the dequeue operation
std::shared_ptr<T> dequeue() {
std::shared_ptr<T> return_value = nullptr;
// do an infinite loop the change the head
while (true) {
node* current_head = this->head.load(std::memory_order_acquire);
node* next_node = current_head->next;
if (this->head.compare_exchange_strong(current_head, next_node,
std::memory_order_release)) {
return_value.swap(next_node->data);
delete current_head;
break;
}
}
return return_value;
}
Note: This queue doesn’t handle the ABA problem. This blog post is too long to bring in hazard pointers, so I’m leaving that out
So there you have it. Atomics in C++. Very complicated and there is zero chance I’d ever put this into production. Especially because I’m fairly certain my concurrent queue would break ;-)
If you want access to the code mentioned above, look at this file and this file.
Here are some articles and links which I found helpful while writing this blog:
I’ve been trying to learn Rust and C++ side by side as a way to dive deeper into systems programming. Specifically, I’m trying to understand more about their concurrency and parallelism models.
As a way of doing this I’m trying to implement a bunch of algorithms that are “embarrassingly parallel”. That is, algorithms that are inherently parallelizable because they use a divide and conquer methodology.
For a full list of these kinds of algorithms, look here at the CMU library.
For this, I decided to implement the easiest one - merge sort.
Merge sort is inherently easy to parallelize because it involves dividing the task of merging an array into a set of smaller contiguous subarrays and we do this recursively, before combining the array in a separate step.
This is a standard algorithm so I’m not going to spend too much time on it.
Here’s the Rust & C++ code for the implementation
fn merge_sort(arr: &mut [i32]) {
let mid = arr.len() / 2;
if mid == 0 {
return;
}
merge_sort(&mut arr[0..mid]);
merge_sort(&mut arr[mid..]);
let mut res = arr.to_vec();
merge(&arr[0..mid], &arr[mid..], &mut res[..]);
arr.copy_from_slice(&res[..]);
}
fn merge(left: &[i32], right: &[i32], ret: &mut [i32]) {
let mut left_index = 0;
let mut right_index = 0;
let mut ret_index = 0;
while left_index < left.len() && right_index < right.len() {
if left[left_index] <= right[right_index] {
ret[ret_index] = left[left_index];
left_index += 1;
} else {
ret[ret_index] = right[right_index];
right_index += 1;
}
ret_index += 1;
}
if left_index < left.len() {
ret[ret_index..].copy_from_slice(&left[left_index..]);
} else if right_index < right.len() {
ret[ret_index..].copy_from_slice(&right[right_index..]);
}
}
void merge(vector<int>& left, vector<int>& right, vector<int>& ret) {
int left_index = 0;
int right_index = 0;
int ret_index = 0;
while (left_index < left.size() && right_index < right.size()) {
if (left[left_index] <= right[right_index]) {
ret[ret_index] = left[left_index];
left_index++;
} else {
ret[ret_index] = right[right_index];
right_index++;
}
ret_index++;
}
while (left_index < left.size()) {
ret[ret_index] = left[left_index];
left_index++;
ret_index++;
}
while (right_index < right.size()) {
ret[ret_index] = right[right_index];
right_index++;
ret_index++;
}
}
void merge_sort(std::vector<int>& arr) {
int mid = arr.size() / 2;
if (mid == 0) {
return;
;
}
vector<int> left(arr.begin(), arr.begin() + mid);
vector<int> right(arr.begin() + mid, arr.end());
merge_sort(left);
merge_sort(right);
vector<int> ret(arr.size());
merge(left, right, ret);
arr = std::move(ret);
}
Of all the parallel algorithms to implement, this is typically the easiest because the only thing you can parallelize here is the division of the arrays at the mid point.
For each division on the left hand side, I fire off a new thread and expect that to give me back the sorted left hand side. The merge function cannot be parallelized because that doesn’t operate on data that is independent of threads.
The implementation in Rust and then C++
fn merge_sort_parallel(arr: &mut [i32]) {
let mid = arr.len() / 2;
if mid == 0 {
return;
}
if arr.len() < THRESHOLD.try_into().unwrap() {
merge_sort(arr);
return;
}
let (left, right) = arr.split_at_mut(mid);
let mut left_clone = left.to_vec();
let mut right_clone = right.to_vec();
let left_handle = thread::spawn(move || {
merge_sort_parallel(&mut left_clone);
return left_clone;
});
merge_sort_parallel(&mut right_clone);
let left_sorted = left_handle.join().unwrap();
let mut res = arr.to_vec();
merge(&left_sorted, &right_clone, &mut res[..]);
arr.copy_from_slice(&res[..]);
}
void merge_sort_parallel(vector<int>& arr) {
int mid = arr.size() / 2;
if (mid == 0) {
return;
}
if (arr.size() < THRESHOLD) {
merge_sort(arr);
return;
}
vector<int> left(arr.begin(), arr.begin() + mid);
vector<int> right(arr.begin() + mid, arr.end());
std::thread left_thread([&left]() { merge_sort_parallel(left); });
merge_sort_parallel(right);
left_thread.join();
vector<int> ret(arr.size());
merge(left, right, ret);
arr = std::move(ret);
}
The benchmarks for each are below
Rust
Time elapsed for sequential sorting is: 8.233871417s
Time elapsed for concurrent sorting is: 2.890036125s
C++
Time elapsed for sequential sorting: 1.17858 seconds
Time elapsed for concurrent sorting: 0.444358 seconds
Both C++ and Rust were compiled for release. Nothing strange about the fact that there is a significant difference between the two runs, especially given that my tests were running for 100,000,000 elements in the array.
You can find the full repo here
]]>I have been working my way through this CMU course which is an introduction to database systems. I’ve primarily been using it as a learning tool for C++ since that seems to be the language of choice for msot implementations of open source database engines.
This week’s project was about implementing a replacer, a buffer pool and a disk interaction mechanism. I thought it was a great way for me to build my intuition on how these things actually work in a database system.
You can find the project here
So, primarily memory storage for a database is divided into two categories
For this project, the assumption is that the data is stored entirely on disk and then fetched from disk when required.
Here’s what the interaction between the three components looks like.
The buffer pool manager is responsible for interaction with the other two components. They never interact with each other directly.
Before we get into the components, let’s quickly explore cache locality
So, this topic comes up quite frequently when designing database systems and I want to dig into it here. What exactly does cache locality mean?
So there’s typically two classes of storage in a computer - non-volatile and volatile. Non-volatile memory means disk and volatile memory means RAM.
Now, volatile memory can actually be more than just RAM. There’s 3 classes of caches - L1, L2 & L3 - that sit between RAM and the CPU cores. Typically L1 & L2 are faster to access than L3.
When the CPU needs to access data from “page 2,” a cache miss occurs at the L1 and possibly L2 levels. The hardware then checks the L3 cache, and if another miss occurs, the request is passed to RAM. The buffer pool manager, realizing the page is not in RAM, issues a disk read request via the operating system. The requested page is then loaded into RAM. As the CPU accesses the data from RAM, it gets loaded into the various cache levels (L3, L2, L1) according to the system’s caching policies. Here’s a decent link if you want to read more about this.
Now that we understand how data propagates through these different levels of memory, let’s get into the concept of cache locality. When page 2 is fetched from disk, it’s quite likely that adjacent pages like page 1 and page 3 are also fetched, even if they weren’t explicitly requested. This is due to the principle of spatial locality, where modern systems assume that if one memory location is accessed, nearby memory locations are likely to be accessed soon after. This is often managed by the disk controller or the operating system through mechanisms like read-ahead or prefetching.
So, in scenarios like a sequential scan, the benefit of spatial locality comes into play. Reading page 2 could also bring page 3 and possibly page 4 into faster levels of memory, reducing the likelihood of cache misses and thereby speeding up subsequent reads.
Now let’s dive into the actual components.
Think of the buffer pool manager as some memory that has been assigned using malloc
when the database engine first starts up. It can hold enough memory for a finite number of pages and the number of pages is sometimes configurable by the end user.
The image above shows a contiguous piece of memory with different colours representing different pages.
Contiguous here is really important and it’s the reason the memory is usually allocated as soon as the database engine starts up versus allocating memory for the buffer pool on a per-need basis.
Having a contiguous piece of memory typically helps with cache locality, which we discussed above.
The numbers in the image above represent the index used to access them when represented in some array like Page[] pages
. But don’t assume that the index itself represents the page number, that isn’t the case.
Each of these pages is mapped to a frame which is a concept shared by the buffer pool manager and the replacer. Each of these frames represents one index in the array of pages above.
Now, to determine what page is mapped to what frame, the buffer pool manager makes use of something called a page table. This is just a std::unordered_map<int, int>
which tells you which page is stored in what frame.
Now, we mentioned above that we have finite memory. One of the problems with finite memory is that it is finite. Assume that our buffer pool initially has space for about 10 pages, and our database grows to 25 pages. At some point we are going to want to read data that isn’t in memory and the buffer pool will have to decide which page to evict to make space for the new page.
The logic controlling eviction is called an eviction policy
. LRU-K is a variant on the classic LRU replacer algorithm. The only difference with LRU-K is that it takes into account the frequency of access and uses that to decide which frame to kick out.
Notice that I say frame here and not page. This level of indirection is important because the replacer has no concept of the page that is being accessed but the frame that is being accessed. That is, the index in the pages array of the buffer pool manager.
If you want to read more about this, check out the course notes from the lecture on memory management.
It’s a fairly straightforward algorithm and implementing it really helps give you an intuition for what is going on.
The last piece is probably the most straightforward. This component literally takes read and write requests and reads and writes those pages from or to disk. Nothing significantly complicated going on here.
If you want to follow along with me working my way through the course, you can do so here. And I would highly recommend checking out the course videos, they are a great resource and are freely available on YouTube.
]]>They seem to come from two different domains - linearizability comes from the world of distributed computing and serializability comes from the world of ANSI SQL.
However, since databases themselves are distributed systems now, these two terms tend to get conflated (especially in my head). I watched a recent video of Martin Kleppman give a talk on transactions here and throughout the talk I couldn’t help but wonder why serializability is the highest standard of isolation in the SQL standard
But, that makes sense because linearizability itself has nothing to do with the SQL standard.
Let’s dive in.
There is a great article here which covers the same topic
Let’s imagine a distributed key value store. The key of note for us here is x.
We have multiple clients that connect to this distributed key value store and try to read the value of x from different nodes. All of the reads from different clients will return the same value, which is good because this means our system is consistent.
Now, let’s say one of the clients updates the value of x from 5
to 6
via a PUT
method. Now, for writes to occur, it will go to one of the nodes and for the duration of this write’s execution and replication, that node will act as a ‘leader’ for the write.
Let’s say the write happens successfully on the leader node and immediately after another client tries to read the value of x
from one of the other nodes. This will return 5, which is incorrect.
Therefore, this system is not linearizable because there is no point in time demarcating when the value of x
changed from 5 to 6 across the system.
Instead of the leader node acknowledging the write immediately after storing it locally, if it had first replicated the write across the other nodes, the system would have been linearizable.
The point in time at which the value of x
changed from 5
to 6
in the system would be called the linearization point. If all changes to any object are done in a linearized manner, it is possible to build a linearizable history of that object (in our case x
). Since we have a linearizable history for each object, it is also possible to have a global order of all operations based on the wall-clock time of each operation.
This is one of the isolation levels in the ACID acronym defined as part of the ANSI SQL standard. So, typically, serializability is used in the context of transactions in a SQL database.
Imagine a database with two values - A & B. The initial value of both these variables is 0.
Now, there are two concurrent transactions operating on the database - T1 & T2.
These two transactions do slightly different things
T_1
will increment the value of A & B by 1T_2
will double the value of A & BSince these transactions are concurrent, the order of operations can be very different. If we were to decide to serially execute these transactions, then first only T_1
would run and only after T_1
finishes would T_2
run. This would result in the values of A=2
and B=2
However, this would seriously limit the throughput of the system.
Another option is to allow the operations to interleave in random order like in the image below. However, the issue with this approach is that the outcome of the transaction would not be correct.
For any set of concurrently occurring transactions, there can be multiple possible serializable histories. As long as these histories result in the same outcome, they are valid. What is a valid outcome? The outcome that would result if you were to run these transactions serially - that is one after another.
In our case, as long as any order of operations result in A=2
and B=2
that would be a valid serializable history.
I have no idea if this is accurate but I wanted to give this a shot. How many possible histories can there be for the transactions listed above.
There are four steps in each transaction - two involved with reading the values of A & B and two involved with writing the values of A & B. This gives us a total of 8 steps.
The total possible permutations of this is 8! = 40,320
However, some steps require an ordering. A & B have to be read before being written by both transactions. There are two possible ways to order the read and write for each value so this would be 2!
permutations and there are four of these sets.
Then, we have 8!/(2! * 2! * 2! * 2!) = 2,520
Again, I have no idea if this calculation is correct (permutations and combinations has always been my weak point) but I thought this was an interesting exercise.
Each of these 2,520
possibitilies is an execution history but not all of them are serializable histories since they don’t necessarily result in a valid outcome.
I believe the point of confusion for me came from conflating distributed systems and databases. Databases are not necessarily distributed systems but they can be.
Earlier, I thought of serializability that was achieved on a single node and linearizability that was achieved across multiple nodes. However, this is the wrong mental model since transactions are also capable of running across multiple nodes.
Instead, it is better to think of consistency in ACID and CAP as separate. Serializability helps maintain the C in ACID and linearizability helps maintain the C in CAP.
]]>Last time I discussed the 2PC protocol and I want to discuss it’s lesser known cousin, the 3PC protocol.
The 3PC protocol seems to be more of an intellectual curiosity than anything else because it hasn’t seen very widespread adoption in real world systems. This is due to a variety of reasons including the complexity of the algorithm.
There are 3 primary phases of the protocol:
Each of these phases represent a message that is sent from the coordinator to every participant that is currently part of the same network.
Now, the main difference from the 2PC protocol comes in the form of adding timeouts between each of the stages.
So, to kick things off the coordinator sends out a CanCommit
message to each of the nodes. The coordinator then immediately starts a timeout interval. If it doesn’t receive a response from any one of the nodes within that interval, it assumes that node has failed and will abort the transaction. The nodes check whether whatever resources need to be locked can be locked and applies those locks.
If the node replies to the CanCommit
message with a YES
, then the node will start a timeout interval. If the node in turn doesn’t receive a PreCommit
message from the coordinator within that interval, it assumes that the coordinator has failed and aborts the transaction.
Once the node receives a PreCommit
message, it will make some preliminary writes to it’s local storage and perhaps to it’s WAL (I’m not a 100% sure here because this seems very implementation dependent). Once these writes are done, it will reply with a YES
or NO
depending on whether or not the writes were completed.
Now, once the DoCommit
message is sent from the coordinator to each node, this is a confirmation that preliminary writes that were done can be applied to durable storage and the locks can be released. Once the locks are released, the node sends an acknowledgement to the coordinator.
Now, comes the interesting bit - the failure modes. This also helps answer why this protocol is any better than 2PC.
In 2PC, if the coordinator fails after the prepare phase, the nodes are stuck in an uncertain state. They would have acquired the locks on the resources but will have no idea what to do next. One of the solutions could perhaps be to wait for a backup leader to come online, determine what the state of the previous transaction is and continue from there.
This is one of the primary advantages of 3PC. If the leader node crashes after sending out a PreCommit
message but before sending out a DoCommit
message, then the nodes have two options:
The termination protocol is where all the nodes discuss with each other what they replied to the last PreCommit
message sent by the leader node.
The fundamental essence of this is essentially a consensus protocol (Raft and Paxos are more complicated versions of this). If there are multiple nodes each with a different value for the same key, how do you decide which value is correct?
Again, the solutions to this would probably be implementation dependent. But, the simplest version would be an abort if any node aborts version. That is, if any node voted no to the PreCommit
message, all nodes will abort the transaction. This ensures consistent state with relatively little complexity overhead.
What is a far more interesting question is how do nodes communicate with each other to determine what outcome to take?
There are a couple of different options:
Each of these communication patterns ends up achieving the same goal, which is disseminating information among nodes to determine what direction to take.
Of course, each of these communication patterns introduce their own set of failures. For instance, depending on a central ledger like ZooKeeper introduces a single point of failure. If the ledger node goes down, how do nodes communicate with each other? The default would be to abort if they cannot reach each other within some pre-defined timeout.
Having said that, this termination protocol is the primary reason 3PC is preferred over 2PC. It doesn’t block in the case of a leader node failure.
A silly question that occurred to me (and I tend to have a lot of silly questions) was, why can’t we just add the termination protocol to 2PC?
There is a prepare phase after which if the leader fails, the entire transaction is blocked. Why not include a termination protocol right there to allow nodes to carry forward.
The answer, of course, is that the leader hasn’t issued a decision yet. The prepare phase is only about collecting votes from nodes to determine if the transaction can be carried out. The leader has yet to decide if the transaction should be carried out.
This is, of course, different from 3PC where there is an explicit stage where the leader collects votes and then issues a decision before asking nodes to commit that decision.
This is the achilles heel of 3PC, where the protocol really fails.
Imagine the following scenario
The coordinator sends out a pre-commit message to the three nodes - A, B & C.
Nodes A and B find that they cannot carry out the associated action and they reject it. Node C can and carries it out.
However, before the coordinator can send out an abort message to all three nodes, a network partition separates node C from the rest.
Now, node C is unaware of any further decisions taken by the coordinator and after the timeout exceeds it’s limit, C will independently decide to commit the transaction.
Now, one of the things I am unsure of is in the presence of a termination protocol, why would C choose to commit? Shouldn’t C choose to fail because it cannot reach any other node?
Once the network partition heals, C will come back online but will have a transaction committed that the other nodes will not and the system will be in an inconsistent state.
One of the strangest things about 3PC is the relative lack of adoption in systems, despite having an obvious advantage over 3PC.
It’s obviously more complex to implement and involves the use of timeouts, which might be hard to maintain and debug, but given that it so clearly unblocks transactions, you would assume that it should see more widespread adoption.
Another drawback of 3PC is the additional latency due to the increased number of rounds of messages.
On a personal note, I wonder if the network partitions leading to an inconsistent state is a bigger problem than it seems. It’s one of those silent failures that might be a killer.
Another silly question I had while writing this was whether clock skew would affect the timeout process between nodes. But, it wouldn’t because the timeout is started by each node individually, so it would be local to that node.
Usually, when you read about 3PC (or 2PC for that matter), the protocol discusses locks that are placed on resources that are currently involved in any transaction.
This led me to wonder how it would work in cases where locks aren’t traditionally employed for writes (like MVCC, which maintains multiple versions of an object). I imagine in this scenario, the PreCommit
phase would involve writing a new version of an object but not committing it. Any other transaction that tried to write a new version of the same object would find that it was blocked because it couldn’t “lock” that resource.
During the Commit
phase, that write would be made visible by committing the version of the object created during the PreCommit
phase, post which new writes to that object could be performed.
Continuing on with stuff I’ve read in Database Internals by Alex Petrov, I wanted to look at 2PC today.
This is a fairly common approach to handling distributed transactions so this will be a fairly short blog post. I used this as more of an exercise to write some Go code for the very first time more than anything else.
Imagine you have a distributed key value store. I love taking the example of a distributed key value store because it’s simple enough to understand that you don’t get lost in the weeds of trying to figure out the system you’re building 2PC for and just focus on 2PC itself.
Our key value store is just going to be a map of type map[string]string
and these maps will be split across multiple nodes.
Each of our nodes can hold some subset of the overall data that we want. However, may or may not be replicated across multiple nodes. So, in situations where we want to modify the value of a key value pair, we would need to do it across all instances where that pair is present. It should update all of them atomically or not at all.
2 phase commit is really simple to understand. It’s called 2 phase because there’s literally 2 phases in it - a prepare phase and a commit phase.
There are two main components involved in 2PC - the coordinator and the participants. There can only be one coordinator and there can be any number of participants.
The prepare phase involves sending out a message to prepare for the transaction from the coorindator and waiting for a response from them. If the coordinator gets a successful response from all the participants, it sends out a message telling participants to commit the transaction.
I like writing code to explore how something actually works because it helps solidify certain concepts.
type Message string
const (
Prepare Message = "PREPARE"
VoteCommit = "VOTE_COMMIT"
VoteAbort = "VOTE_ABORT"
Commit = "COMMIT"
Abort = "ABORT"
)
type Coordinator struct {
participants []*Participant
}
type Participant struct {
index int
data map[string]string
locks map[string]bool
}
func NewParticipant(index int) *Participant {
return &Participant{
index: index,
data: make(map[string]string),
locks: make(map[string]bool),
}
}
So, we have a few different structs - one for the coordinator and one for the participants. We also have a way to create a new participant because each of them needs to hold some data.
Okay, now we just need to fire off a transaction and see if that can be successfully completed or not.
func (c *Coordinator) initiateTransaction(oldValue string, newValue string) {
// Phase 1: Prepare Phase
for _, participant := range c.participants {
msg := participant.prepare(oldValue, newValue)
if msg == VoteAbort {
c.abortTransaction(oldValue, newValue)
return
}
}
// Phase 2: Commit Phase
c.commitTransaction(oldValue, newValue)
}
func main() {
data := []string{"apple", "banana", "cherry", "date", "fig"}
participants := make([]*Participant, len(data))
for i, value := range data {
participants[i] = NewParticipant(i)
participants[i].data[value] = value
}
coordinator := &Coordinator{
participants: participants,
}
var wg sync.WaitGroup
wg.Add(1)
go func() {
defer wg.Done()
coordinator.initiateTransaction("apple", "APPLE")
}()
wg.Wait()
}
We fire off a single transaction that tries to rewrite apple
to APPLE
within a goroutine.
For each transaction that is fired off the coordinator checks with each participant if it can fulfil this transaction or not.
If it can’t, the coordinator gives up on the whole transaction. If it can, it sends out a message indicating that the transaction can indeed be committed.
2PC is a blocking protocol. This means that if a participant is down, the whole transaction is blocked until it comes back up. This is a pretty big drawback because it means that the whole system is only as reliable as the least reliable participant.
2PC is also a synchronous protocol. This means that the coordinator has to wait for a response from all participants before it can move on to the next phase. This can be a problem if the participants are geographically distributed and the latency between them is high.
2PC also has a single point of failure because if the coordinator ever goes down at any point, the transaction will either fail or even worse the transaction might go into an inconsistent state.
So, that’s 2PC in a nutshell. A really simple protocol which actually has massive adoption because of the ease of implementation.
There is a better version of this called 3 phase commit which I’ll explore in another article.
If you want to find the full version of the code for 2PC, look here
]]>I’ve recently been reading Database Internals by Alex Petrov, and I came across a section title Reusable Infrastructure For Linearizability.
Now, the book itself doesn’t go into much detail about the implementation of RIFL so I got down to reading the paper and an article on another blog about it.
I want to explain this concept in this article because this helps me understand things better.
Imagine you have a client and a server. The server can represent anything, a database or a key-value store etc. It doesn’t really matter.
The client makes a request to write something to the server. The server receives the request, processes it and delivers a successful response. However, before the client can receive the response and send an acknowledgement, it crashes. Now, when the client comes back up again, it is going to re-try the request which will lead to the server writing the value again.
This wouldn’t be a big problem if the server was designed with idempotency in mind but that isn’t always the case. Take the classic example of incrementing a counter, it would be incremented twice for the same request in this case.
This problem seems like a variation of the classic two generals problem, which is a pretty famous problem in distributed computing.
Now, RIFL is designed to make RPC calls linearizable. This means that if the same RPC call is retried multiple times by a client, the operation itself isn’t carried out multiple times.
This helps ensure that the system remains linearizable so that all clients observing it see the system in the same state after a certain point in time.
It offers a way to upgrade ‘at-least-once’ semantics to ‘exactly once’, which is the holy grail in terms of distributed systems communication.
There are three main subsystems in RIFL:
Each of these have a very specific role to play.
The request tracker generates a unique id per request that comes in to the client.
When a new request comes into the client, it generates a unique id for this request and persists it along with some metadata to some form of local storage. It tracks the completion status of this request as part of the metadata.
The lease manager is a little more interesting. It took a little digging to understand exactly what the purpose of this is.
For every request that comes in, a client must contact the Lease Manager and obtain a lease for the resource it wants to access. The identifier for this lease is the unique identifier for the RPC call. The client is expected to renew this lease within some time period. If the lease for a client expires, the server does not service the request.
Here’s a nice article on leases which can explain the concept in more depth.
The result tracker module is meant to provide completion records for each request, sometimes called a Completion Object.
Everytime a request is serviced by the server, either successfully or otherwise, it is written to the local persistence store as a Completion Object record. Once the server has sent the response back to the client and received acknowledgement from the client of it’s receipt, it marks the record for garbage collection which runs during the compaction process.
The request comes in from an application to a client. Now, the client generates a unique id for this request. In RIFL, identifiers consist of two parts: a 64-bit unique identifier for the client and a 64-bit sequence number allocated by that client.
Next, the request must attain a lease from the Lease Manager module. When a server receives a request, it checks if the lease is still valid by contacting the Lease Manager. If the lease is nearing expiration, the Lease Manager checks if the client is still reachable, and if it is, the lease is renewed. If the client is unreachable, the lease is allowed to expire.
Now, assuming the lease has been validated by the server, it is free to carry out the transaction required by the request. It typically does this by following a 2PC protocol. Before running this protocol, it first inserts a completion record for this request indicating the current status of the request. This completion record is identified by the request’s unique identifier.
Assuming, the request is completed successfully and the client acknowledges receipt, the completion record can be marked as completed and garbage collected later.
The classic failure scenario is where the client sends a request which is successfully served by the server, but the client crashes either:
Let’s take a simple example of an e-commerce application. Assume that we are managing the state of a cart. Now, the application issues a request to the client to ADD(eggs, cart)
which will add eggs to a cart.
The client issues this request to the server and then crashes before receiving and acknowledging a response.
Now, when the client restarts, it will notice that it has a request locally persisted which has not been completed. It will retry this request with the same request id.
The server receives this request and uses the request id as a lookup to see if it has already been serviced. Assuming it has already been serviced, the server will not perform the operation again but will instead return a successful response to the client.
The client will then update it’s local store to mark the request as completed and send it’s acknowledgement to the server. Upon receipt of the acknowledgement, the server will mark the record as acknowledged and mark it for garbage collection.
Reading about RIFL was pretty interesting and I’d highly recommend reading the paper itself.
RIFL attempts to first build a linearizable layer upon which to execute transactions. However, despite the failure scenarios it handles, it depends on the client being reliable enough that it first records and persist the request it is making before issuing it.
And what happens if a client one level up crashes immediately after issuing the request? Then that layer would require linearizability. The paper has a section which addresses this.
There is also a section on performance in the paper that is worth checking out.
]]>