Wednesday, August 12, 2020

Cassandra: Distributed key-value store optimized for write-heavy workloads

 Here is the article. 

I like to quickly go over the content written by Ame, and then write down my highlights here:

  1. Cassandra is a popular distributed key value store, built initially at Facebook using commodity severs for allowing users to search through their inbox messages.
  2. While TAO, which i covered here, was optimized for reads
  3. Cassandra is optimized for write heavy workload while maintaining a good performance for reads
  4. Google's bigTable vs Cassandra - what I can remember? 
  5. Data model - multi-dimensional key-value map

Cassandra is a popular distributed key value store, built initially at Facebook using commodity severs for allowing users to search through their inbox messages. While TAO, which i covered here, was optimized for reads, Cassandra is optimized for write heavy workload while maintaining a good performance for reads. Some of the key design objectives of Cassandra seem to be:

  1. While writes are in the orders of billions and need to be optimized for using append-only logs, reads can be very geographically diverse. Hence cross-data center replication is necessary and important for good read efficiency.
  2. Scaling of platform by adding commodity servers, as more users get added to the platform.
  3. Fault tolerance should be the norm.
  4. Giving clients of the database a simple interface and more control over type of schema, availability and performance etc.

Google’s bigtable also serves somewhat similar overall purpose. But the key difference here is that bigtable was built on top of GFS which provides durability of data. Durability of data is built into Cassandra via explicit replication mechanisms.

Data Model

Cassandra like BigTable provides a fairly simple data model. It consists of small keys of tens of bytes in size and some structured format of client’s liking. Keys are mapped to a sets of columns called column families. This way Cassandra data model can be viewed as multi dimensional key-value map.

In cassandra, any row mutation is atomic. So when updating a key and corresponding columns/column-families, all the data is written or none. Columns can be sorted via time e.g. latest messages can be at the top.

The simple API(and fairly self-explanatory) for accessing Cassandra is:

insert(table, key, rowMutation)

get(table, key, columnName)

delete(table, key, columnName)

Request Flow or a key

In Cassandra, for both reads and writes, any end-user request can land on any node. This node is then aware of the replicas that “own” this keyspace. Then the node will route the request to these replicas.

In case of a write operation, replication is important for durability of the data. Hence the node where the request initially landed waits to establish quorum from the responses from replicas.

For reads, this requirement can be relaxed depending on the requirements of the application. For a client app, not too worried about strong consistency, can respond back when the first response arrives from the replica or can wait for establishing the quorum.

Architecture

Any large distributed storage system needs to worry about Persistence or Durability of data, Fault tolerance, Partitioning of the data, Addition or removal of nodes from the system, Scaling. Let’s talk through some of the important aspects. One key fact to keep in mind is that all nodes in Cassandra are aware of each other via Gossip based protocols.

Partitioning — Mapping keys to Nodes/Servers

Cassandra needs to be able to scale by adding more servers and also needs to adjust to failures of nodes without compromising the performance. Hence Cassandra uses consistent hashing for mapping keys to Servers/Nodes. The advantage of consistent hashing is that if a node is added or removed from the system, then all the existing key-server mapping are NOT impacted like in traditional hashing methods. Only the neighbor nodes are impacted and the redistribution of keys occurs among neighbors.

In this scheme, each node gets assigned to some keyrange. Assume that the entire keyrange can be mapped onto the circumference of a circle. So when a key arrives, it gets mapped to some point on the circumference of the circle. Then the algorithm walks clockwise until it finds the next node that is mapped onto the circle. This node is now the coordinator for this key. When this node goes down for some reason, the next clockwise node on the circumference will become the coordinator for the related keyrange. Since different nodes on the circle will have different load, Cassandra allows for lightly loaded nodes to move closer to the heavily loaded nodes.

No comments:

Post a Comment