Oct. 8, 2021
My notes:
- Learn three architecture options: Shared-nothing, shared disk, and shared-memory
- Definition of shared-nothing? Think about it in a minute.
- Definition: A shared-nothing architecture is simply a distributed architecture. This means the server nodes do not share either memory or disk and have their own copy of the operating system, hence they share “nothing” other than the network to communicate with other nodes
- Google the following statement - SAN - storage area network, block-based protocol
- Shared-disk requires disks to be globally accessible by the nodes, which requires a storage area network (SAN) that uses a block-based protocol.
- block-based protocol - learn three things in 20 minutes
- SAN - a storage area network (SAN) - in 20 minutes
- SAN - focus on a block-based protocol - in 10 minutes
The Case for Shared Nothing
Ricardo Jimenez-Peris
Shared-nothing has become the dominant parallel architecture for big data systems, such as MapReduce and Spark, analytics platforms, NoSQL databases and search engines [Özsu & Valduriez 2020]. The reason is simple: it is the only architecture that can provide scalability at reasonable cost, typically within a cluster of servers. In the context of cluster computing, scalability can be further characterized by the terms scale-up versus scale-out. Scale-up (also called vertical scaling) refers to adding more power (processor, memory, IO devices) to a server and thus gets limited by the maximum size of the server, e.g. 32 processors. Scale-out (also called horizontal scaling) refers to adding more servers, called “scale-out servers”, in a loosely coupled fashion, to scale almost infinitely.
In a parallel database system, the real challenge is to scale linearly with increasing workloads (see our blog post on Scalability), including more users and more data. For instance, if you double the size of your cluster, you would expect to support a workload that is twice as big. For an OLAP workload, this may mean dividing the response time of a large analytical query by two, whereas for an OLTP workload this may mean doubling the system throughput (e.g., number of transactions per minute). Note that these are very different objectives, which can be achieved with different architectures and techniques. The level of difficulty to implement basic functions (concurrency control, fault-tolerance, availability, database design and tuning, load balancing, etc.) varies from one architecture to the other.
Stonebraker proposed the term “shared-nothing” to better contrast with two other popular architectures, shared-memory and shared-disk. A shared-nothing architecture is simply a distributed architecture. This means the server nodes do not share either memory or disk and have their own copy of the operating system, hence they share “nothing” other than the network to communicate with other nodes [Valduriez 2018]. In the 1980s, shared-nothing was just emerging (with pioneers like Teradata and Tandem NonStopSQL) and shared-memory was the dominant architecture.
In shared-memory (see above), any processor (P) has access to any memory module (M) or disk unit through some interconnect. All the processors are under the control of a single operating system. One major advantage is simplicity of the programming model, which is based on shared virtual memory. Metadata (e.g. directory) and control data (e.g., lock tables) can be shared by all processors, which means that writing database software is not very different than for single processor computers. In particular, load balancing is easy since it can be achieved at runtime by allocating each new task to the processor with least load. However, the major problem with shared-memory is limited scalability and availability.
With increasingly quick processors (even with larger caches), conflicting accesses to the shared-memory increase rapidly and degrade performance. Furthermore, since the memory space is shared by all processors, a memory fault may affect most of them, thereby hurting data availability. Depending on whether physical memory is shared, two approaches are possible: Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA) (see Figure 2). UMA is the architecture of multicore processors while NUMA is used for tightly-coupled multiprocessors. Interestingly, today’s servers are NUMA, which introduces many difficulties for database managers since they were built for UMA. In particular, seeing as all threads can access memory from an CPU, this results in a high fraction of accesses to remote NUMA memory. This results in exhausting the memory bandwidth to remote memories that is significantly smaller than the memory bandwidth of the local memory. Most database servers just do not scale up linearly in NUMA and a lot of work is needed for them to become NUMA-aware.
In a shared-disk architecture, any processor has access to any disk unit, but exclusive (nonshared) access to its main memory. Each processor–memory node is under the control of its own copy of the operating system and can communicate with other nodes through the network. Then, each node can access database pages on the shared-disk and cache them into its own memory. Shared-disk requires disks to be globally accessible by the nodes, which requires a storage area network (SAN) that uses a block-based protocol. Since different processors can access the same page in conflicting update modes, global cache consistency is needed. This is typically achieved using a distributed lock manager, which is complex to implement and introduces significant contention. Shared-disk has three main advantages: simple and cheap administration, high availability, and good load balance. Database administrators do not need to deal with complex data partitioning, and the failure of a node only affects its cached data, while the data on disk is still available to the other nodes. Furthermore, load balancing is easy because any request can be processed by any node. The main disadvantages are cost (because of the cost of the SAN) and limited scalability, caused by the potential bottleneck and overhead of cache coherence protocols for large databases. In the case of OLTP workloads, shared-disk has remained the preferred option as it makes load balancing easy and efficient. However, their scalability is heavily limited by the distributed locking that causes severe contention and limits the scalability to a few nodes.
Today, a shared-nothing architecture (see Figure 4) is cost-effective as servers can be off-the-shelf components (multicore processor, memory and disk) connected by a regular Ethernet network or a low-latency network such as Infiniband or Myrinet. To maximize performance (yet at additional cost), nodes could also be NUMA multiprocessors, thus leading to a hierarchical parallel database system [Bouganim et al. 1996]. By favoring the smooth incremental growth of the system by the addition of new nodes, shared-nothing provides excellent scalability, unlike shared-memory and shared-disk. However, it requires careful partitioning of the data on multiple disk nodes. Furthermore, the addition of new nodes in the system presumably requires reorganizing and repartitioning of the database to deal with the load balancing issues. Finally, fault-tolerance is more difficult than with shared-disk, seeing as a failed node will make its data on disk unavailable, thus requiring data replication. It is due to its scalability advantage that shared-nothing has been first adopted for OLAP workloads, in particular data warehousing, as it is easier to parallelize read-only queries. For the same reason, it has been adopted for big data systems, which are typically read-intensive.
A final question is: can shared-nothing be used to support big write-intensive transactional workloads as well? NoSQL (see our blog post on NoSQL), in particular key-value systems, have excellent horizontal scalability, i.e., scaling over a cluster of nodes. However, since they did not manage to scale transactional management, they gave up on transactional consistency; choosing to focus instead on scalability. However, it is possible to provide both scalability for data management and transactional management (see our blog post on the CAP theorem). Yet, it is a hardcore problem that only a few systems in the NewSQL category have managed to solve (see our blog post on NewSQL).
No comments:
Post a Comment