Tuesday, June 8, 2021

Three papers to read | GFS | MapReduce | Bigtable

 In 2003, Google published a paper titled “The Google File System”. This scalable distributed

file system, abbreviated as GFS, uses a cluster of commodity hardware to store

huge amounts of data. The filesystem handled data replication between nodes so that

losing a storage server would have no effect on data availability. It was also optimized

for streaming reads so that data could be read for processing later on.

Shortly afterward, another paper by Google was published, titled “MapReduce: Simplified

Data Processing on Large Clusters”. MapReduce was the missing piece to the

GFS architecture, as it made use of the vast number of CPUs each commodity server

in the GFS cluster provides. MapReduce plus GFS forms the backbone for processing

massive amounts of data, including the entire search index Google owns.

What is missing, though, is the ability to access data randomly and in close to real-time

(meaning good enough to drive a web service, for example). Another drawback of the

GFS design is that it is good with a few very, very large files, but not as good with

millions of tiny files, because the data retained in memory by the master node is ultimately

bound to the number of files. The more files, the higher the pressure on the

memory of the master.

So, Google was trying to find a solution that could drive interactive applications, such

as Mail or Analytics, while making use of the same infrastructure and relying on GFS

for replication and data availability. The data stored should be composed of much

smaller entities, and the system would transparently take care of aggregating the small

records into very large storage files and offer some sort of indexing that allows the user

to retrieve data with a minimal number of disk seeks. Finally, it should be able to store

the entire web crawl and work with MapReduce to build the entire search index in a

timely manner.

Being aware of the shortcomings of RDBMSes at scale (see “Seek Versus Transfer”

on page 315 for a discussion of one fundamental issue), the engineers approached

this problem differently: forfeit relational features and use a simple API that has basic

create, read, update, and delete (or CRUD) operations, plus a scan function to iterate

over larger key ranges or entire tables. The culmination of these efforts was published

in 2006 in a paper titled “Bigtable: A Distributed Storage System for Structured Data”,

two excerpts from which follow:

Bigtable is a distributed storage system for managing structured data that is designed to

scale to a very large size: petabytes of data across thousands of commodity servers.

…a sparse, distributed, persistent multi-dimensional sorted map.

It is highly recommended that everyone interested in HBase read that paper. It describes

a lot of reasoning behind the design of Bigtable and, ultimately, HBase. We will, however,

go through the basic concepts, since they apply directly to the rest of this book.

HBase is implementing the Bigtable storage architecture very faithfully so that we can

explain everything using HBase. Appendix F provides an overview of where the two

systems differ.

From page 16 - 17, Book: HBase definitive guide -> Chap 1, building blocks, Backdrop 



No comments:

Post a Comment