Friday, September 17, 2021

Cassandra The definitive guide | Eben Hewit | My three hours reading time

Sept. 17, 2021

Introduction

It will take me at least three hours to learn a few things and start to love reading the book. The book called Cassandra The definitive guide will be my favorite things next few days. I just could not believe that I spent average two hours a day last week to study stocks, I wish that I can study more on this book. 

First hour | 10 new concepts | Reading is so powerful for me to learn 

Chapter 1, Introducing Cassandra

This chapter introduces Cassandra and discusses what’s exciting and different about it, who is using it, and what its advantages are.

Chapter 2, Installing Cassandra

This chapter walks you through installing Cassandra on a variety of platforms.

Chapter 3, The Cassandra Data Model

Here we look at Cassandra’s data model to understand what columns, super columns, and rows are. Special care is taken to bridge the gap between the relational database world and Cassandra’s world.

Chapter 4, Sample Application

This chapter presents a complete working application that translates from a relational model in a well-understood domain to Cassandra’s data model.

Chapter 5, The Cassandra Architecture

This chapter helps you understand what happens during read and write operations and how the database accomplishes some of its notable aspects, such as durability and high availability. We go under the hood to understand some of the more complex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees, and more.

Chapter 6, Configuring Cassandra

This chapter shows you how to specify partitioners, replica placement strategies, and snitches. We set up a cluster and see the implications of different configuration choices.

Chapter 7, Reading and Writing Data

This is the moment we’ve been waiting for. We present an overview of what’s different about Cassandra’s model for querying and updating data, and then get to work using the API.

Chapter 8, Clients

There are a variety of clients that third-party developers have created for many different languages, including Java, C#, Ruby, and Python, in order to abstract Cassandra’s lower-level API. We help you understand this landscape so you can choose one that’s right for you.

Chapter 9, Monitoring

Once your cluster is up and running, you’ll want to monitor its usage, memory patterns, and thread patterns, and understand its general activity. Cassandra has a rich Java Management Extensions (JMX) interface baked in, which we put to use to monitor all of these and more.

Chapter 10, Maintenance

The ongoing maintenance of a Cassandra cluster is made somewhat easier by some tools that ship with the server. We see how to decommission a node, load-balance the cluster, get statistics, and perform other routine operational tasks. 

Chapter 11, Performance Tuning

One of Cassandra’s most notable features is its speed—it’s very fast. But there are a number of things, including memory settings, data storage, hardware choices, caching, and buffer sizes, that you can tune to squeeze out even more performance.

Chapter 12, Integrating Hadoop

In this chapter, written by Jeremy Hanna, we put Cassandra in a larger context and see how to integrate it with the popular implementation of Google’s Map/Reduce algorithm, Hadoop.

Appendix

Many new databases have cropped up in response to the need to scale at Big Data levels, or to take advantage of a “schema-free” model, or to support more recent initiatives such as the Semantic Web. Here we contextualize Cassandra against a variety of the more popular nonrelational databases, examining document oriented databases, distributed hashtables, and graph databases, to better understand Cassandra’s offerings.

Glossary

It can be difficult to understand something that’s really new, and Cassandra has many terms that might be unfamiliar to developers or DBAs coming from the relational application development world, so I’ve included this glossary to make it easier to read the rest of the book. If you’re stuck on a certain concept, you can flip to the glossary to help clarify things such as Merkle trees, vector clocks, hinted handoffs, read repairs, and other exotic terms.


ACID is an acronym for Atomic, Consistent, Isolated, Durable, which are the gauges we can use to assess that a transaction has executed properly and that it was successful:

Atomic

Atomic means “all or nothing”; that is, when a statement is executed, every update within the transaction must succeed in order to be called successful. There is no partial failure where one update was successful and another related update failed. The common example here is with monetary transfers at an ATM: the transfer requires subtracting money from one account and adding it to another account. This operation cannot be subdivided; they must both succeed.

Consistent

Consistent means that data moves from one correct state to another correct state, with no possibility that readers could view different values that don’t make sense together. For example, if a transaction attempts to delete a Customer and her Order history, it cannot leave Order rows that reference the deleted customer’s primary key; this is an inconsistent state that would cause errors if someone tried to read those Order records.

Isolated

Isolated means that transactions executing concurrently will not become entangled with each other; they each execute in their own space. That is, if two different transactions attempt to modify the same data at the same time, then one of them will have to wait for the other to complete.

Durable

Once a transaction has succeeded, the changes will not be lost. This doesn’t imply another transaction won’t later modify the same data; it just means that writers can be confident that the changes are available for the next transaction to work with as necessary.

Cassandra in 50 Words or Less

“Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.” That’s exactly 50 words.

Page 15

Once you start to scale many other data stores (MySQL, Bigtable), some nodes need to be set up as masters in order to organize other nodes, which are set up as slaves. Cassandra, however, is decentralized, meaning that every node is identical; no Cassandra node performs certain organizing operations distinct from any other node. Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.

The fact that Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster function exactly the same. This is sometimes referred to as “server symmetry.” Because they are all doing the same thing, by definition there can’t be a special host that is coordinating activities, as with the master/slave setup that you see in MySQL, Bigtable, and so many others.

Page 17 

Eventual consistency is one of several consistency models available to architects. Let’s take a look at these models so we can understand the trade-offs:

Strict consistency

This is sometimes called sequential consistency, and is the most stringent level of consistency. It requires that any read will always return the most recently written value. That sounds perfect, and it’s exactly what I’m looking for. I’ll take it! However, upon closer examination, what do we find? What precisely is meant by “most recently written”? Most recently to whom? In one single-processor machine, this is no problem to observe, as the sequence of operations is known to the one clock.

But in a system executing across a variety of geographically dispersed data centers, it becomes much more slippery. Achieving this implies some sort of global clock that is capable of timestamping all operations, regardless of the location of the data or the user requesting it or how many (possibly disparate) services are required to determine the response.

Causal consistency

This is a slightly weaker form of strict consistency. It does away with the fantasy of the single global clock that can magically synchronize all operations without creating an unbearable bottleneck. Instead of relying on timestamps, causal consistency instead takes a more semantic approach, attempting to determine the cause of events to create some consistency in their order. It means that writes that are potentially related must be read in sequence. If two different, unrelated operations suddenly write to the same field, then those writes are inferred not to be causally related. But if one write occurs after another, we might infer that they are causally related. Causal consistency dictates that causal writes must be read in sequence.

Weak (eventual) consistency

Eventual consistency means on the surface that all updates will propagate throughout all of the replicas in a distributed system, but that this may take some time. Eventually, all replicas will be consistent.



No comments:

Post a Comment