In 2003, Google published a paper titled “The Google File System”. This scalable distributed
file system, abbreviated as GFS, uses a cluster of commodity hardware to store
huge amounts of data. The filesystem handled data replication between nodes so that
losing a storage server would have no effect on data availability. It was also optimized
for streaming reads so that data could be read for processing later on.
Shortly afterward, another paper by Google was published, titled “MapReduce: Simplified
Data Processing on Large Clusters”. MapReduce was the missing piece to the
GFS architecture, as it made use of the vast number of CPUs each commodity server
in the GFS cluster provides. MapReduce plus GFS forms the backbone for processing
massive amounts of data, including the entire search index Google owns.
What is missing, though, is the ability to access data randomly and in close to real-time
(meaning good enough to drive a web service, for example). Another drawback of the
GFS design is that it is good with a few very, very large files, but not as good with
millions of tiny files, because the data retained in memory by the master node is ultimately
bound to the number of files. The more files, the higher the pressure on the
memory of the master.
So, Google was trying to find a solution that could drive interactive applications, such
as Mail or Analytics, while making use of the same infrastructure and relying on GFS
for replication and data availability. The data stored should be composed of much
smaller entities, and the system would transparently take care of aggregating the small
records into very large storage files and offer some sort of indexing that allows the user
to retrieve data with a minimal number of disk seeks. Finally, it should be able to store
the entire web crawl and work with MapReduce to build the entire search index in a
timely manner.
Being aware of the shortcomings of RDBMSes at scale (see “Seek Versus Transfer”
on page 315 for a discussion of one fundamental issue), the engineers approached
this problem differently: forfeit relational features and use a simple API that has basic
create, read, update, and delete (or CRUD) operations, plus a scan function to iterate
over larger key ranges or entire tables. The culmination of these efforts was published
in 2006 in a paper titled “Bigtable: A Distributed Storage System for Structured Data”,
two excerpts from which follow:
Bigtable is a distributed storage system for managing structured data that is designed to
scale to a very large size: petabytes of data across thousands of commodity servers.
…a sparse, distributed, persistent multi-dimensional sorted map.
It is highly recommended that everyone interested in HBase read that paper. It describes
a lot of reasoning behind the design of Bigtable and, ultimately, HBase. We will, however,
go through the basic concepts, since they apply directly to the rest of this book.
HBase is implementing the Bigtable storage architecture very faithfully so that we can
explain everything using HBase. Appendix F provides an overview of where the two
systems differ.
From page 16 - 17, Book: HBase definitive guide -> Chap 1, building blocks, Backdrop
No comments:
Post a Comment