Oct. 29, 2021
Here is the link.
2.3 Architecture
A GFS cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients, as shown
in Figure 1. Each of these is typically a commodity Linux
machine running a user-level server process. It is easy to run
both a chunkserver and a client on the same machine, as long
as machine resources permit and the lower reliability caused
by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. Each chunk is
identified by an immutable and globally unique 64 bit chunk
handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and
read or write chunk data specified by a chunk handle and
byte range. For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though
users can designate different replication levels for different
regions of the file namespace.
My notes:
- Each chunk - 64 bit chunk handle - master assigned at the time of chunk creation
- Each chunk is replicated on multiple chunkservers.
- By default, we store three replicas - replication levels ?
The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.
It also controls system-wide activities such as chunk lease
management, garbage collection of orphaned chunks, and
chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat
messages to give it instructions and collect its state.
My notes:
- HeartBeat messages - master <-> chunkserver
- master maintains all file system metadata - namespace, access control information, the mapping from files to chunks, and the current locations of chunks.
- Also system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers.
GFS client code linked into each application implements
the file system API and communicates with the master and
chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to
the chunkservers. We do not provide the POSIX API and
therefore need not hook into the Linux vnode layer.
Neither the client nor the chunkserver caches file data.
Client caches offer little benefit because most applications
stream through huge files or have working sets too large
to be cached. Not having them simplifies the client and
the overall system by eliminating cache coherence issues.
(Clients do cache metadata, however.) Chunkservers need
not cache file data because chunks are stored as local files
and so Linux’s buffer cache already keeps frequently accessed
data in memory.
2.4 Single Master
Having a single master vastly simplifies our design and
enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However,
we must minimize its involvement in reads and writes so
that it does not become a bottleneck. Clients never read
and write file data through the master. Instead, a client asks
the master which chunkservers it should contact. It caches
this information for a limited time and interacts with the
chunkservers directly for many subsequent operations.
Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunk size, the client
translates the file name and byte offset specified by the application into a chunk index within the file. Then, it sends
the master a request containing the file name and chunk
index. The master replies with the corresponding chunk
handle and locations of the replicas. The client caches this
information using the file name and chunk index as the key.
The client then sends a request to one of the replicas,
most likely the closest one. The request specifies the chunk
handle and a byte range within that chunk. Further reads
of the same chunk require no more client-master interaction
until the cached information expires or the file is reopened.
In fact, the client typically asks for multiple chunks in the
same request and the master can also include the information for chunks immediately following those requested. This
extra information sidesteps several future client-master interactions at practically no extra cost.
2.5 Chunk Size
Chunk size is one of the key design parameters. We have
chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain
Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal
fragmentation, perhaps the greatest objection against such
a large chunk size.
A large chunk size offers several important advantages.
First, it reduces clients’ need to interact with the master
because reads and writes on the same chunk require only
one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files
sequentially. Even for small random reads, the client can
comfortably cache all the chunk location information for a
multi-TB working set. Second, since on a large chunk, a
client is more likely to perform many operations on a given
chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended
period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata
in memory, which in turn brings other advantages that we
will discuss in Section 2.6.1.
My notes:
A large chunk size offers several important advantages:
- First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information.
- The reduction is especially significant for our workloads because applications mostly read and write large files sequentially.
- Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set.
- Understand the first point in step 1 to step 3. More details are from different angles in step 2 and step 3.
- Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time.
- Try to understand second point, a persistent TCP connection to the chunkserver over an extended period of time. What is persistent TCP connection?
- Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
On the other hand, a large chunk size, even with lazy space
allocation, has its disadvantages. A small file consists of a
small number of chunks, perhaps just one. The chunkservers
storing those chunks may become hot spots if many clients
are accessing the same file. In practice, hot spots have not
been a major issue because our applications mostly read
large multi-chunk files sequentially.
However, hot spots did develop when GFS was first used
by a batch-queue system: an executable was written to GFS
as a single-chunk file and then started on hundreds of machines at the same time. The few chunkservers storing this
executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables
with a higher replication factor and by making the batchqueue system stagger application start times. A potential
long-term solution is to allow clients to read data from other
clients in such situations.
2.6 Metadata
The master stores three major types of metadata: the file
and chunk namespaces, the mapping from files to chunks,
and the locations of each chunk’s replicas. All metadata is
kept in the master’s memory. The first two types (namespaces and file-to-chunk mapping) are also kept persistent by
logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. Using
a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master
crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins
the cluster.
2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are
fast. Furthermore, it is easy and efficient for the master to
periodically scan through its entire state in the background.
This periodic scanning is used to implement chunk garbage
collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers. Sections 4.3 and 4.4 will discuss
these activities further.
One potential concern for this memory-only approach is
that the number of chunks and hence the capacity of the
whole system is limited by how much memory the master
has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB
chunk. Most chunks are full because most files contain many
chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then
64 bytes per file because it stores file names compactly using prefix compression.
If necessary to support even larger file systems, the cost
of adding extra memory to the master is a small price to pay
for the simplicity, reliability, performance, and flexibility we
gain by storing the metadata in memory.
2.6.2 Chunk Locations
The master does not keep a persistent record of which
chunkservers have a replica of a given chunk. It simply polls
chunkservers for that information at startup. The master
can keep itself up-to-date thereafter because it controls all
chunk placement and monitors chunkserver status with regular HeartBeat messages.
Remember four things:
- a persistent record of which chunkservers have a replica of a given chunk
- the master polls chunkservers for step 1 information at startup
- the master monitors chunkserver status with regular HeartBeat messages
- the master controls all chunk placement
We initially attempted to keep chunk location information
persistently at the master, but we decided that it was much
simpler to request the data from chunkservers at startup,
and periodically thereafter. This eliminated the problem of
keeping the master and chunkservers in sync as chunkservers
join and leave the cluster, change names, fail, restart, and
so on. In a cluster with hundreds of servers, these events
happen all too often.
Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks
it does or does not have on its own disks. There is no point
in trying to maintain a consistent view of this information
on the master because errors on a chunkserver may cause
chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.
No comments:
Post a Comment