Jan. 10, 2022
Introduction
It is hard for me to write a very good article about Bigtable, so I like to copy the lecture note and then read word by word. I will write down what I learn each time I read the lecture notes.
Here is the link of lecture notes.
Bigtable
A NoSQL wide-column single-table database
Paul Krzyzanowski
November 3, 2021
Goal: How can we build an ultra-high performance, low-latency storage service for large-scale structured and semi-structured data?
Introduction
Traditional relational databases present a view of multiple tables, each containing rows and named columns. Queries, mostly performed in SQL (Structured Query Language) allow one to extract specific columns from a row where certain conditions are met (e.g., a column has a specific value). Moreover, one can perform queries across multiple tables (this is the “relational” part of a relational database). For example, a table of students may include a student’s name, ID number, and contact information. A table of grades may include a student’s ID number, course number, and grade. We can construct a query that extracts a grades by name by searching for the ID number in the student table and then matching that ID number in the grade table.
With traditional relational databases, we expect ACID guarantees: that transactions will be atomic, consistent, isolated, and durable. This includes operations that access or modify multiple fields, multiple rows, and multiple tables. The CAP theorem proved that is not possible to guarantee consistency while providing high availability and network partition tolerance. Partitions are unavailable in distributed systems, so the design choice is between high availability and consistency. ACID databases choose consistency, which usually involves locking and waiting: a transaction needs to update all replicas and cannot have some concurrent transactions work with old data while others access new data. In distributed architectures, we often choose to give up consistency in order to provide high availability. This makes ACID databases unattractive for highly distributed environments and led to the emergence of alternate data stores that are targeted to high availability and high performance. These data stores are often referred to as NoSQL databases.
Another aspect of conventional database systems is that they often do not do well with tables containing huge amounts of columns or even fields that contain huge amounts of data. Adding additional fields to a column (hence changing the schema of the database) can be a time-consuming task.
Here, we will look at the structure and capabilities of Bigtable. It is not a relational database; it is just a table but one that is designed to support efficient lookups and handle data on a huge scale. It is also designed as a wide-column store. This means that each row of a table can support a huge number of columns and the specific column names may vary from row to row.
Bigtable
Bigtable is a distributed storage system that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed for storing items such as billions of URLs, with many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands of queries a second.
Bigtable was developed at Google in has been in use since 2005 in dozens of Google services. An open source version, HBase, was created by the Apache project on top of the Hadoop core (using the Hadoop Distributed File System, HDFS, instead of GFS and using Apache Zookeeper instead of Google’s Chubby).
Bigtable is designed with semi-structured data storage in mind. It is a large map where every item is indexed by a row key, column key, and a timestamp. Each value within the map is an array of bytes that is interpreted by the application. Clients can look up a row key and then iterate over all of its columns and over versions of data wihtin each column. Every read or write of data to a row is atomic, regardless of how many diferent columns are read or written within that row.
It is easy enough to picture a simple table but let us examine a few characteristics of Bigtable and what makes it special:
- map
- A map is an associative array; a data structure that allows one to look up a value to a corresponding key quickly. Bigtable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.
- persistant
- The data is stored persistently on disk.
- distributed
- Bigtable’s data is distributed among many independent machines. At Google, Bigtable was built on top of GFS (Google File System)1. The Apache open source version of Bigtable, HBase, is built on top of HDFS (Hadoop Distributed File System) or Amazon S3. The table is broken up among along rows, so a sequence of adjacent rows will be managed by the same server. A row itself is never distributed.
- sparse
- The table is sparse, meaning that different rows in a table may use vastly different columns (there could be millions), with many – or even most – of the columns empty for a particular row.
- sorted
- In most databases or object stores, data is not sorted. A key is hashed to find its a position in a table. Bigtable, on the other hand, sorts its data by keys. This helps keep related data close together, usually on the same machine — assuming that one structures keys in such a way that sorting brings the data together. For example, if domain names are used as keys in a Bigtable, it makes sense to store them in reverse order to ensure that related domains are close together. For example:
- edu.rutgers.cs
- edu.rugtgers.nb
- edu.rutgers.www
users:homer
or sysinfo:cpu
. Since you can have an unlimited number of columns that can be added dynamically, Bigtable provides iterators to allow clients to discover and iterate over all columns within a column family.Columns and column families
Let’s look at a sample slice of a table that stores web pages (this example is from Google’s paper on Bigtable). The row key is the page URL. For example, com.cnn.www
.
Various attributes of the page are stored in column families. A contents
column family contains page contents (there are no columns within this column family). A language
column family contains the language identifier for the page. Finally, an anchor
column family contains the text of various anchors from other web pages. The column name is the URL of the page making the reference. These three column families underscore a few points.
A column may be a single short value, as seen in the language
column family. This is our classic database view of columns. In Bigtable, however, there is no type associated with the column. It is just a bunch of bytes.
The data in a column family may also be large, as in the contents
column family.
The anchor
column family illustrates the extra hierarchy created by having columns within a column family. It also illustrates the fact that columns can be created dynamically (one for each external anchor), unlike column families.
Finally, it illustrates the sparse aspect of Bigtable. In this example, the list of columns within the anchor
column family will likely vary tremendously for each URL. Each column name within anchor
is the name of the URL that contains a link to the URL indicated by the row key. The value of the column is the text on the page that contains the link. For example, cnnsi.com
contains a link with the text CNN
that links to www.cnn.com
. The interesting feature of a wide-column store such as Bigtable is that the name of the column is itself a form of a value. A client can iterate through all the column names within anchor
to get a list of URLs that contain links to a specific web page (the row key).
In all, we may have a huge number (e.g., hundreds of thousands or millions) of columns overall but the column family within each row will often have only a tiny fraction of them populated. While the number of column families will typically be small in a table (at most hundreds), the number of columns is unlimited.
Rows and partitioning
A table is logically split among rows into multiple subtables called tablets. A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing within Bigtable. Because the table is always sorted alphabetically by row, reads of short ranges of rows are efficient: one typically communicates with one or a small number of machines. Hence, a key to ensuring a high degree of locality is to select row keys properly (as in the earlier example of using domain names in reverse order).
What is a tablet? A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing within Bigtable.
- A set of rows
- A set of consecutive rows
- A set of consecutive rows of a table
- unit of distribution - What is the unit of distribution?
- load balancing
- load balancing within Bigtable
- What is a tablet? Think about, and talk about rows, a set of rows - consecutive rows
- Read again
Timestamps
Each column family cell can contain multiple versions of content. For example, in the earlier example, we may have several timestamped versions of page contents associated with a URL. Each version is identified by a 64-bit timestamp that either represents real time or is a value assigned by the client. Reading column data retrieves the most recent version if no timestamp is specified or the latest version that is earlier than a specified timestamp.
A table is configured with per-column-family settings for garbage collection of old versions. A column family can be defined to keep only the latest n versions or to keep only the versions written since some time t.
Implementation
Bigtable comprises a client library (linked with the user’s code), a master server that coordinates activity, and many tablet servers. Tablet servers can be added or removed dynamically. If the master dies, another master can take over.
The master assigns tablets to tablet servers and balances tablet server load. It is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation). As such, a tablet server is responsible for tablets but the tablets do not necessarily live on that node since any node can access any data within GFS.
Each tablet server manages a set of tablets (typically 10–1,000 tablets per server). It handles read/write requests to the tablets it manages and splits tablets when a tablet gets too large. Client data does not move through the master; clients communicate directly with tablet servers for reads/writes. The internal file format for storing data is Google’s SSTable, which is a persistent, ordered, immutable map from keys to values. Rows of data are always kept sorted by the row key.
Study notes:
- Tablet server
- Each tablet server manages a set of tablets
- how many tablets - 10 - 1,000 tablets per server
- SSTable - internal file format for storing data - MemTable and SSTable
- SSTable - a persistent, ordered, immutable map from keys to values
- Rows of data are always kept sorted by the row key
Bigtable uses the Google File System (GFS) for storing both data files and logs. A cluster management system contains software for scheduling jobs, monitoring health, and dealing with failures.
Chubby
Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs with five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. It uses a Paxos distributed consensus algorithm to keep the replicas consistently synchronized. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.
In Bigtable, Chubby is used to:
- ensure there is only one active master
- store the bootstrap location of Bigtable data
- discover tablet servers
- store Bigtable schema information
- store access control lists
Startup and growth
A table starts off with just one tablet. As the table grows, it is split into multiple tablets. By default, a table is split at around 100 to 200 MB.
Locating rows within a Bigtable is managed in a three-level hierarchy. The root (top-level) tablet stores the location of all Metadata tablets in a special Metadata table. This root tablet is simply the first tablet in the set of tablets that comprise the Metadata table. Each Metadata table contains the location of user data tablets. This table is keyed by node IDs and each row identifies a tablet’s table ID and end row. For efficiency, the client library caches tablet locations.
A tablet is assigned to one tablet server at a time. Chubby keeps track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a Chubby servers directory. The master monitors this directory to discover new tablet servers.
When the master starts, it:
- Grabs a unique master lock in Chubby (to prevent multiple masters from starting)
- Scans the servers directory in Chubby to find live tablet servers
- Communicates with each tablet server to discover what tablets are assigned to each server. This is important because the master might be recovering for a failed master and tablets have already been allocated to tablet servers.
- Scans the Metadata table to learn the full set of tablets
- Builds a set of unassigned tablet servers. These are eligible for tablet assignment and the master will choosing a tablet server and send it a tablet load request.
Fault tolerance and replication
Some of the fault tolerance for Bigtable is provided by Google and Chubby. GFS, for example, provides configurable levels of replication of file data and Chubby’s cell of replicated servers minimizes its downtime.
A master is responsible for detecting when a specific tablet server is not functioning. It does this by asking the tablet server for status of its lock (recall that Chubby grants locks). If the tablet server cannot be reached or has lost its lock, the master attempts to grab that server’s lock. If it succeeds, then it surmises the tablet server is dead or cannot contact Chubby. In this case, the master moves the tablets that were previously assigned to that server into an unassigned state.
When a master’s Chubby lease expires, it kills itself. This does not change the assignment of tablets to servers, however. Google’s cluster management system periodically checks for the liveness of a master. If it detects a non-responding master, it starts one up, which grabs a lock from Chubby. The new master contacts Chubby to find all the live servers goes through the startup phase described earlier.
A Bigtable can be configured for replication onto multiple Bigtable clusters in different data centers to ensure availability. Data propagation is asynchronous and results an eventually consistent model.
References
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, Google, Inc. OSDI 2006: The definitive paper on Bigtable.
Google, Overview of Bigtable, Google Cloud Documentation. This describes the current Google Cloud offering of Bigtable. Note that GFS has been replaced with Colossus (a newer distributed file system) and tablet servers are now called Bigtable nodes.
Ilya Grigorik, SSTable and Log Structured Storage: LevelDB, igvita.com, February 26, 2012: _a description of the SSTable (Sorted String Table) used in Bigtable.
Cloud Bigtable: A publicly-available version of Bigtable, part of the Google Cloud Platform
Google, Bigtable for Cassandra users, Google Cloud – Cloud Architecture Center.
Robin Harris, Google’s Bigtable Distributed Storage System](http://storagemojo.com/2006/09/07/googles-bigtable-distributed-storage-system-pt-i/), StorageMojo.com
Apache HBase: An open-source project based on the design of Bigtable.
Understanding HBase and Bigtable, Jumoojw.com
This is an updated version to one that was originally published in November 2011.
No comments:
Post a Comment