Wednesday, May 19, 2021

System design: Distributed database consistency | Highly Available Transactions: Virtues and Limitations

May 19, 2021

Here is the link. 

I like to use Google blogger as my reader, and I like to read better about the technical paper. 


Highly Available Transactions: Virtues and Limitations

ABSTRACT 

To minimize network latency and remain online during server failures and network partitions, many modern distributed data storage systems eschew transactional functionality, which provides strong semantic guarantees for groups of multiple operations over multiple data items. In this work, we consider the problem of providing Highly Available Transactions (HATs): transactional guarantees that do not suffer unavailability during system partitions or incur high network latency. We introduce a taxonomy of highly available systems and analyze existing ACID isolation and distributed data consistency guarantees to identify which can and cannot be achieved in HAT systems. This unifies the literature on weak transactional isolation, replica consistency, and highly available systems. We analytically and experimentally quantify the availability and performance benefits of HATs—often two to three orders of magnitude over wide-area networks—and discuss their necessary semantic compromises. 

1 Introduction 

The last decade has seen a shift in the design of popular large scale database systems, from the use of transactional RDBMSs [14, 38, 37] to the widespread adoption of loosely consistent distributed key-value stores [21, 23, 30]. Core to this shift was the 2000 introduction of Brewer’s CAP Theorem, which stated that a highly available system cannot provide “strong” consistency guarantees in the presence of network partitions [16]. As formally proven [35], the CAP Theorem pertains to a data consistency model called linearizability, or the ability to read the most recent write to a data item that is replicated across servers [41]. However, despite its narrow scope, the CAP Theorem is often misconstrued as a broad result regarding the ability to provide ACID database properties with high availability [9, 16, 22]; this misunderstanding has led to substantial confusion regarding replica consistency, transactional isolation, and high availability. The recent resurgence of transactional systems suggests that programmers value transactional semantics, but most existing transactional data stores do not provide availability in the presence of partitions [19, 22, 42, 25, 48, 59, 61].

CAP theorem - data store, ACID, network partitions, consistency guarantees, Brewer's CAP theorem in 2000, a data consistency model called linearizability, ability to read the most recent write to a data item that is replicated across servers, replication, data time, most recent write, read

Indeed, serializable transactions - the gold standard of traditional ACID databases - are not achievable with high availability in the presence of network partitions [28]. However, database systems have a long tradition of providing weaker isolation and consistency guarantees [2, 12, 37, 38, 43]. Today’s ACID and NewSQL databases often employ weak isolation models due to concurrency and performance benefits; weak isolation is overwhelmingly the default setting in these stores and is often the only option offered (Section 3). While weak isolation levels do not provide serializability for general-purpose transactions, they are apparently strong enough to deliver acceptable behavior to many application programmers and are substantially stronger than the semantics provided by current highly available systems. This raises a natural question: which semantics can be provided with high availability?

To date, the relationship between ACID semantics and high availability has not been well explored. We have a strong understanding of weak isolation in the single-server context from which it originated [2, 12, 38] and many papers offer techniques for providing distributed serializability [14, 25, 27, 42, 61] or snapshot isolation [43, 59]. Additionally, the distributed computing and parallel hardware literature contains many consistency models for single operations on replicated objects [23, 41, 48, 49, 60]. However, the literature lends few clues for providing semantic guarantees for multiple operations operating on multiple data items in a highly available distributed environment.

weak isolation, single-server context, distributed serializability, snapshot isolation, ...

Our main contributions in this paper are as follows. We relate the many previously proposed database isolation and data consistency models to the goal of high availability, which guarantees a response from each non-failing server in the presence of arbitrary network partitions between them. We classify which among the wide array of models are achievable with high availability, denoting them as Highly Available Transactions (HATs). In doing so, we demonstrate that although many implementations of HAT semantics are not highly available, this is an artifact of the implementations rather than an inherent property of the semantics. Our investigation shows that, besides serializability, Snapshot Isolation and Repeatable Read isolation are not HAT-compliant, while most other isolation levels are achievable with high availability. We also demonstrate that many weak replica consistency models from distributed systems are both HAT-compliant and simultaneously achievable with several ACID properties.

database isolation and data consistency models, Snapshot isolation and repeatable read isolation are not HAT-compliant. 

Our investigation is based on both impossibility results and several constructive, proof-of-concept algorithms. For example, Snapshot Isolation and Repeatable Read isolation are not HAT-compliant because they require detecting conflicts between concurrent updates (as needed for preventing Lost Updates or Write Skew phenomena), which we show is unavailable. However, Read Committed isolation, transactional atomicity (Section 5.1.2), and many other consistency models from database and distributed systems are achievable via algorithms that rely on multi-versioning and limited client-side caching. For several guarantees, such as causal consistency with phantom prevention and ANSI Repeatable Read, we consider a modified form of high availability in which clients “stick to” (i.e., have affinity with) at least one server—a property which is often implicit in the distributed systems literature [41, 48, 49] but which requires explicit consideration in a client-server replicated database context. This sticky availability is widely employed [48, 64] but is a less restrictive model (and therefore more easily achievable) than traditional high availability. 

Google causal consistency with phantom prevention and ANSI repeatable read later...

At a high level, the virtues of HATs are guaranteed responses from any replica, low latency, and a range of semantic guarantees including several whose usefulness is widely accepted such as Read Committed. However, highly available systems are fundamentally unable to prevent concurrent updates to shared data items and cannot provide recency guarantees for reads. To understand when these virtues and limitations are relevant in practice, we survey both practitioner accounts and academic literature, perform experimental analysis on modern cloud infrastructure, and analyze representative applications for their semantic requirements. Our experiences with a HAT prototype running across multiple georeplicated datacenters indicate that HATs offer a one to three order of magnitude latency decrease compared to traditional distributed serializability protocols, and they can provide acceptable semantics for a wide range of programs, especially those with monotonic logic and commutative updates [4, 58]. HAT systems can also enforce arbitrary foreign key constraints for multi-item updates and, in some cases, provide limited uniqueness guarantees. However, HATs can fall short for applications with concurrency-sensitive operations, requiring unavailable, synchronous coordination.

Finally, we recognize that the large variety of ACID isolation levels and distributed consistency models (and therefore those in our taxonomy) can be confusing; the subtle distinctions between models may appear to be of academic concern. Accordingly, we offer the following pragmatic takeaways: 

1. The default (and sometimes strongest) configurations of most widely deployed database systems expose a range of anomalies that can compromise application-level consistency. 

2. Many of these “weak isolation” models are achievable without sacrificing high availability if implemented correctly. However, none of the achievable models prevents concurrent modifications. 

3. In addition to providing a guaranteed response and horizontal scale-out, these highly available HAT models allow one to three order of magnitude lower latencies on current infrastructure. 

4. For correct behavior, applications may require a combination of HAT and (ideally sparing use of) non-HAT isolation levels; future database designers should plan accordingly.

2 Why High Availability? 

Why does high availability matter? Peter Deutsch starts his classic list of “Fallacies of Distributed Computing” with two concerns fundamental to distributed database systems: “1.) The network is reliable. 2.) Latency is zero” [32]. In a distributed setting, network failures may prevent database servers from communicating, and, in the absence of failures, communication is slowed by factors like physical distance, network congestion, and routing. As we will see (Section 4), highly available system designs mitigate the effects of network partitions and latency. In this section, we draw on a range of evidence that indicates that partitions occur with frequency in real-world deployments and latencies between data centers are substantial, often on the order of several hundreds of milliseconds.

2.1 Network Partitions at Scale 

According to James Hamilton, Vice President and Distinguished Engineer on the Amazon Web Services team, “network partitions should be rare but net gear continues to cause more issues than it should” [39]. Anecdotal evidence confirms Hamilton’s assertion. In April 2011, a network misconfiguration led to a twelve-hour series of outages across the Amazon EC2 and RDS services [7]. Subsequent misconfigurations and partial failures such as another EC2 outage in October 2012 have led to full site disruptions for popular web services like Reddit, Foursquare, and Heroku [33]. At global scale, hardware failures—like the 2011 outages in Internet backbones in North America and Europe due a router bug [57]—and misconfigurations like the BGP faults in 2008 [51] and 2010 [52] can cause widespread partitioning behavior.

Many of our discussions with practitioners - especially those operating on public cloud infrastructure—as well as reports from large scale operators like Google [29] confirm that partition management is an important consideration for service operators today. System designs that do not account for partition behavior may prove difficult to operate at scale: for example, less than one year after its announcement, Yahoo!’s PNUTS developers explicitly added support for weaker, highly available operation. The engineers explained that “strict adherence [to strong consistency] leads to difficult situations under network partitioning or server failures...in many circumstances, applications need a relaxed approach” [53].

Several recent studies rigorously quantify partition behavior. A 2011 study of several Microsoft datacenters observed over 13,300 network failures with end-user impact, with an estimated median 59,000 packets lost per failure. The study found a mean of 40.8 network link failures per day (95th percentile: 136), with a median time to repair of around five minutes (and up to one week). Perhaps surprisingly, provisioning redundant networks only reduces impact of failures by up to 40%, meaning network providers cannot easily curtail partition behavior [36]. A 2010 study of over 200 wide-area routers found an average of 16.2–302.0 failures per link per year with an average annual downtime of 24–497 minutes per link per year (95th percentile at least 34 hours) [63]. In HP’s managed enterprise networks, WAN, LAN, and connectivity problems account for 28.1% of all customer support tickets while 39% of tickets relate to network hardware. The median incident duration for highest priority tickets ranges from 114–188 minutes and up to a full day for all tickets [62]. Other studies confirm these results, showing median time between connectivity failures over a WAN network of approximately 3000 seconds with a median time to repair between 2 and 1000 seconds [50] as well as frequent path routing failures on the Internet [45]. A recent, informal report by Kingsbury and Bailis catalogs a host of additional practitioner reports [44]. Not surprisingly, isolating, quantifying, and accounting for these network failures is an area of active research in networking community [47].

These studies indicate that network partitions do occur within and across modern datacenters. We observe that these partitions must be met with either unavailability at some servers or, as we will discuss, relaxed semantic guarantees.

2.2 Latency and Planet Earth 

Even with fault-free networks, distributed systems face the challenge of network communication latency, Deutsch’s second “Fallacy.” In this section, we quantify round-trip latencies, which are often large—hundreds of milliseconds in a geo-replicated, multi-data-center context. Fundamentally, the speed at which two servers can communicate is (according to modern physics) bounded by the speed of light. In the best case, two servers on opposite sides of the Earth—communicating via a hypothetical link through the planet’s core—require a minimum 85.1ms round-trip time (RTT; 133.7ms if sent at surface level). As services are replicated to multiple, geographically distinct sites, this cost of communication increases. 

In real deployments, messages travel slower than the speed of light due to routing, congestion, and server-side overheads. To illustrate the difference between intra-datacenter, inter-datacenter, and inter-planetary networks, we performed a measurement study of network behavior on Amazon’s EC2, a widely used public compute cloud. We measured one week of ping times (i.e., roundtrip times, or RTTs) between all seven EC2 geographic “regions,” across three “availability zones” (closely co-located datacenters), and within a single “availability zone” (datacenter), at a granularity of 1s1 . We summarize the results of our network measurement study in Table 1. On average, intra-datacenter communication (Table 1a) is between 1.82 and 6.38 times faster than across geographically co-located datacenters (Table 1b) and between 40 and 647 times faster than across geographically distributed datacenters (Table 1c). The cost of wide-area communication exceeds the speed of light: for example, while a speed-of-light RTT from Sao Paulo to ˜ Singapore RTT is 106.7ms, ping packets incur an average 362.8ms RTT (95th percentile: 649ms). As shown in Figure 1, the distribution of latencies varies between links, but the trend is clear: remote communication has a substantial cost. Quantifying and minimizing communication delays is also an active area of research in the networking community [66]. 

3 ACID in the Wild 

The previous section demonstrated that distributed systems must address partitions and latency: what does this mean for distributed databases? Database researchers and designers have long realized that serializability is not achievable in a highly available system [28], meaning that, in environments like those in Section 2, database designs face a choice between availability and strong semantics. However, even in a single-node database, the coordination penalties associated with serializability can be severe and are manifested in the form of decreased concurrency (and, subsequently, performance degradation, scalability limitations, and, often, aborts due to deadlock or contention) [38]. Accordingly, to increase concurrency, database systems offer a range of ACID properties weaker than serializability: the host of so-called weak isolation models describe varying restrictions on the space of schedules that are allowable by the system [2, 5, 12]. None of these weak isolation models guarantees serializability, but, as we see below, their benefits are of ten considered to outweigh costs of possible consistency anomalies that might arise from their use.

To understand the prevalence of weak isolation, we recently surveyed the default and maximum isolation guarantees provided by 18 databases, often claiming to provide “ACID” or “NewSQL” functionality [9]. As shown in Table 2, only three out of 18 databases provided serializability by default, and eight did not provide serializability as an option at all. This is particularly surprising when we consider the widespread deployment of many of these nonserializable databases, like Oracle 11g, which are known to power major businesses and product functionality. Given that these weak transactional models are frequently used, our inability to provide serializability in arbitrary HATs appears non-fatal for practical applications. If application writers and database vendors have already decided that the benefits of weak isolation outweigh potential application inconsistencies, then, in a highly available environment that prohibits serializability, similar decisions may be tenable.


To understand the prevalence of weak isolation, we recently surveyed the default and maximum isolation guarantees provided by 18 databases, often claiming to provide “ACID” or “NewSQL” functionality [9]. As shown in Table 2, only three out of 18 databases provided serializability by default, and eight did not provide serializability as an option at all. This is particularly surprising when we consider the widespread deployment of many of these nonserializable databases, like Oracle 11g, which are known to power major businesses and product functionality. Given that these weak transactional models are frequently used, our inability to provide serializability in arbitrary HATs appears non-fatal for practical applications. If application writers and database vendors have already decided that the benefits of weak isolation outweigh potential application inconsistencies, then, in a highly available environment that prohibits serializability, similar decisions may be tenable.

It has been unknown which of these guarantees can be provided with high availability, or are HAT-compliant. Existing algorithms for providing weak isolation are often designed for a single-node context and are, to the best of our knowledge, unavailable due to reliance on concurrency control mechanisms like locking that are not resilient to partial failure (Section 6.1). Moreover, we are not aware of any prior literature that provides guidance as to the relationship between weak isolation and high availability: prior work has examined the relationship between serializability and high availability [28] and weak isolation in general [2, 12, 38] but not weak isolation and high availability together. A primary goal in the remainder of this paper is to understand which models are HAT-compliant.

4 High Availability 

To understand which guarantees can be provided with high availability, we must first define what high availability means. In this section, we will formulate a model that captures a range of availability models, including high availability, availability with stickiness, and transactional availability.

Informally, highly available algorithms ensure “always on” operation and, as a side effect, guarantee low latency. If users of a highly available system are able to contact a (set of) server(s) in a system, they are guaranteed a response; this means servers will not need to synchronously communicate with others. If servers are partitioned from one another, they do not need to stall in order to provide clients a “safe” response to operations. This lack of fast path coordination also means that a highly available system also provides low latency [1]; in a wide-area setting, clients of a highly available system need not wait for cross-datacenter communication. To properly describe whether a transactional system is highly available, we need to describe what servers a client must contact as well as what kinds of responses a server can provide, especially given the possibility of aborts.

Traditionally, a system provides high availability if every user that can contact a correct (non-failing) server eventually receives a response from that server, even in the presence of arbitrary, indefinitely long network partitions between servers [35].2 As in a standard distributed database, designated servers might perform operations for different data items. A server that can handle an operation for a given data item is called a replica for that item.

4.1 Sticky Availability 

In addition to high availability, which allows operations on any replica, distributed algorithms often assume a model in which clients always contact the same logical replica(s) across subsequent operations, whereby each of the client’s prior operations (but not necessarily other clients’ operations) are reflected in the database state that they observe. As we will discuss in Section 5, clients can ensure continuity between operations (e.g., reading their prior updates to a data item) by maintaining affinity or “stickiness” with a server or set of servers [64]. In a fully replicated system, where all servers are replicas for all data items, stickiness is simple: a client can maintain stickiness by contacting the same server for each of its requests. However, to stay “sticky” in a partially-replicated system, where servers are replicas for subsets of the set of data items (which we consider in this paper), a client must maintain stickiness with a single logical copy of the database, which may consist of multiple physical servers. We say that a system provides sticky availability if, whenever a client’s transactions is executed against a copy of database state that reflects all of the client’s prior operations, it eventually receives a response, even in the presence of indefinitely long partitions (where “reflects” is dependent on semantics). A client may choose to become sticky available by acting as a server itself; for example, a client might cache its reads and writes [11, 60, 67]. Any guarantee achievable in a highly available system is achievable in a sticky high availability system but not vice-versa. 

4.2 Transactional Availability 

Until now, we have considered single-object, single-operation availability. This is standard in the distributed systems literature (e.g., distributed register models such as linearizability all concern single objects [41]), yet the database literature largely focuses on transactions: groups of multiple operations over multiple objects. Accordingly, by itself, traditional definitions of high availability are insufficient to describe availability guarantees for transactions. Additionally, given the choice of commit and abort responses— which signal transaction success or failure to a client—we must take care in defining transactional availability.

We say that a transaction has replica availability if it can contact at least one replica for every item it attempts to access; this may result in “lower availability” than a non-transactional availability requirement (e.g., single-item availability). Additionally, given the possibility of system-initiated aborts, we need to ensure useful forward progress: a system can trivially guarantee clients a response by always aborting all transactions. However, this is an unsatisfactory system because nothing good (transaction commit) ever happens; we should require a liveness property [54].

A system cannot guarantee that every transaction will commit— transactions may choose to abort themselves—but we need to make sure that the system will not indefinitely abort transactions on its own volition. We call a transaction abort due to a transaction’s own choosing (e.g., as an operation of the transaction itself or due to a would-be violation of a declared integrity constraint) an internal abort and an abort due to system implementation or operation an external abort. We say that a system provides transactional availability if, given replica availability for every data item in a transaction, the transaction eventually commits (possibly after multiple client retries) or internally aborts [9]. A system provides sticky transactional availability if, given sticky availability, a transaction eventually commits or internally aborts.

5 Highly Available Transactions 

HAT systems provide transactions with transactional availability or sticky transactional availability. They offer latency and availability benefits over traditional distributed databases, yet they cannot achieve all possible semantics. In this section, we describe ACID, distributed replica consistency, and session consistency levels which can be achieved with high availability (Read Committed isolation, variants of Repeatable Read, atomic reads, and many session guarantees), those with sticky availability (read your writes, PRAM and causal consistency). We also discuss properties that cannot be provided in a HAT system (those preventing Lost Update and Write Skew or guaranteeing recency). We present a full summary of these results in Section 5.3. 

As Brewer states, “systems and database communities are separate but overlapping (with distinct vocabulary)” [16]. With this challenge in mind, we build on existing properties and definitions from the database and distributed systems literature, providing a brief, informal explanation and example for each guarantee. The database isolation guarantees require particular care, since different DBMSs often use the same terminology for different mechanisms and may provide additional guarantees in addition to our implementation-agnostic definitions. We draw largely on Adya’s dissertation [2] and somewhat on its predecessor work: the ANSI SQL specification [5] and Berenson et al.’s subsequent critique [12].

For brevity, we provide an informal presentation of each guarantee here (accompanied by appropriate references) but give a full set of formal definitions in our extended Technical Report [8]. In our examples, we exclusively consider read and write operations, denoting a write of value v to data item d as wd(v) and a read from data item d returning v as rd(v). We assume that all data items have the null value, , at database initialization, and, unless otherwise specified, all transactions in the examples commit.

5.1 Achievable HAT 

Semantics To begin, we present well-known semantics that can be achieved in HAT systems. In this section, our primary goal is feasibility, not performance. As a result, we offer proof-of-concept highly available algorithms that are not necessarily optimal or even efficient: the challenge is to prove the existence of algorithms that provide high availability. However, we briefly study a subset of their performance implications in Section 6.

5.1.1 ACID Isolation Guarantees 

To begin, Adya captures Read Uncommitted isolation as PL-1. In this model, writes to each object are totally ordered, corresponding to the order in which they are installed in the database. In a distributed database, different replicas may receive writes to their local copies of data at different times but should handle concurrent updates (i.e., overwrites) in accordance with the total order for each item. PL-1 requires that writes to different objects be ordered consistently across transactions, prohibiting Adya’s phenomenon G0 (also called “Dirty Writes” [12]). If we build a graph of transactions with edges from one transaction to another and, when the former overwrites the latter’s write to the same object, then, under Read Uncommitted, the graph should not contain cycles [2]. Consider the following example: 

T1 ∶ wx(1) wy(1) 

T2 ∶ wx(2) wy(2) 

In this example, under Read Uncommitted, it is impossible for the database to order T1’s wx(1) before T2’s wx(2) but order T2’s wy(2) before T1’s wy(1). Read Uncommitted is easily achieved by marking each of a transaction’s writes with the same timestamp (unique across transactions; e.g., combining a client’s ID with a sequence number) and applying a “last writer wins” conflict reconciliation policy at each replica. Later properties will strengthen Read Uncommitted.

I can fully understand the example. Good teaching. 

Read Committed isolation is particularly important in practice as it is the default isolation level of many DBMSs (Section 3). Centralized implementations differ, with some based on long-duration exclusive locks and short-duration read locks [38] and others based on multiple versions. These implementations often provide recency and monotonicity properties beyond what is implied by the name “Read Committed” and what is captured by the implementation-agnostic definition: under Read Committed, transactions should not access uncommitted or intermediate versions of data items. This prohibits both “Dirty Writes”, as above, and also “Dirty Reads” phenomena. This isolation is Adya’s PL-2 and is formalized by prohibiting Adya’s G1{a-c} (or ANSI’s P1, or “broad” P1 [2.2] from Berenson et al.). For instance, in the example below, T3 should never see a = 1, and, if T2 aborts, T3 should not read a = 3: 

T1 ∶ wx(1) wx(2) 

T2 ∶ wx(3) 

T3 ∶ rx(a) 

It is fairly easy for a HAT system to prevent “Dirty Reads”: if each client never writes uncommitted data to shared copies of data, then transactions will never read each others’ dirty data. As a simple solution, clients can buffer their writes until they commit, or, alternatively, can send them to servers, who will not deliver their value to other readers until notified that the writes have been committed. Unlike a lock-based implementation, this implementation does not provide recency or monotonicity guarantees but it satisfies the implementation-agnostic definition.

Concepts: recency or monotonicity guarantees, the implementation-agnostic definition

Several different properties have been labeled Repeatable Read isolation. As we will show in Section 5.2.1, some of these are not achievable in a HAT system. However, the ANSI standardized implementation-agnostic definition [5] is achievable and directly captures the spirit of the term: if a transaction reads the same data more than once, it sees the same value each time (preventing “Fuzzy Read,” or P2). In this paper, to disambiguate between other definitions of “Repeatable Read,” we will call this property “cut isolation,” since each transaction reads from a non-changing cut, or snapshot, over the data items. If this property holds over reads from discrete data items, we call it Item Cut Isolation, and, if we also expect a cut over predicate-based reads (e.g., SELECT WHERE; preventing Phantoms [38], or Berenson et al.’s P3/A3), we have the stronger property of Predicate Cut-Isolation. In the example below, under both levels of cut isolation, T3 must read a = 1:

T1 ∶ wx(1) 

T2 ∶ wx(2) 

T3 ∶ rx(1) rx(a) 

It is possible to satisfy Item Cut Isolation with high availability by having transactions store a copy of any read data at the client such that the values that they read for each item never changes unless they overwrite it themselves. These stored values can be discarded at the end of each transaction and can alternatively be accomplished on (sticky) replicas via multi-versioning. Predicate Cut Isolation is also achievable in HAT systems via similar caching middleware or multi-versioning that track entire logical ranges of predicates in addition to item based reads.

5.1.2 ACID Atomicity Guarantees 

Atomicity, informally guaranteeing that either all or none of transactions’ effects should succeed, is core to ACID guarantees. Although, at least by the ACID acronym, atomicity is not an “isolation” property, atomicity properties also restrict the updates visible to other transactions. Accordingly, here, we consider the isolation effects of atomicity, which we call Monotonic Atomic View (MAV) isolation. Under MAV, once some of the effects of a transaction Ti are observed by another transaction Tj , thereafter, all effects of Ti are observed by Tj . That is, if a transaction Tj reads a version of an object that transaction Ti wrote, then a later read by Tj cannot return a value whose later version is installed by Ti. Together with item cut isolation, MAV prevents Read Skew anomalies (Berenson et al.’s A5A) and is useful in several contexts such as maintaining foreign key constraints, consistent global secondary indexing, and maintenance of derived data. In the example below, under MAV, because T2 has read T1’s write to y, T2 must observe b = c = 1 (or later versions for each key):

T1 ∶ wx(1) wy(1) wz(1) 

T2 ∶ rx(a) ry(1) rx(b) rz(c) 

T2 can also observe a = , a = 1, or a later version of x. In the hierarchy of existing isolation properties, we place MAV below Adya’s PL-2L (as it does not necessarily enforce transitive read-write dependencies) but above Read Committed (P L−2). Notably, MAV requires disallows reading intermediate writes (Adya’s G1b): observing all effects of a transaction implicitly requires observing the final (committed) effects of the transaction as well.

Perplexingly, discussions of MAV are absent from existing treatments of weak isolation. This is perhaps again due to the single-node context in which prior work was developed: on a single server (or a fully replicated database), MAV is achievable via lightweight locking and/or local concurrency control over data items [26, 43]. In contrast, in a distributed environment, MAV over arbitrary groups of non-co-located items is considerably more difficult to achieve with high availability.

As a straw man, replicas can store all versions ever written to each data item. Replicas can gossip information about versions they have observed and construct a lower bound on the versions that can be found on every replica (which can be represented by either a list of versions, or, more realistically, a vector clock). At the start of each transaction, clients can choose a read timestamp that is lower than or equal to the this global lower bound, and, during transaction execution, replicas return the latest version of each item that is not greater than the client’s chosen timestamp. If this lower bound is advanced along transactional boundaries, clients will observe MAV. This algorithm has several variants in the literature [20, 67], and older versions can be asynchronously garbage collected.

We have developed a more efficient MAV algorithm, which we sketch here and provide greater detail in our extended Technical Report [8]. We begin with our Read Committed algorithm, but replicas wait to reveal new writes to readers until all of the replicas for the final writes in the transaction have received their respective writes (are pending stable). Clients include additional metadata with each write: a single timestamp for all writes in the transaction (e.g., as in Read Uncommitted) and a list of items written to in the transaction. When a client reads, the return value’s timestamp and list of items form a lower bound on the versions that the client should read for the other items. When a client reads, it attaches a timestamp to its request representing the current lower bound for that item. Replicas use this timestamp to respond with either a write matching the timestamp or a pending stable write with a higher timestamp. Servers keep two sets of writes for each data item: the write with the highest timestamp that is pending stable and a set of writes that are not yet pending stable. This is entirely master-less and operations never block due to replica coordination.

Concepts: MAV algorithm, - Monotonic Atomic View, Read committed algorithm, replicas, master-less

5.1.3 Session Guarantees 

A useful class of safety guarantees refer to real-time or client-centric ordering within a session, “an abstraction for the sequence of...operations performed during the execution of an application” [60]. These “session guarantees” have been explored in the distributed systems literature [60, 64] and sometimes in the database literature [27]. For us, a session describes a context that should persist between transactions: for example, on a social networking site, all of a user’s transactions submitted between “log in” and “log out” operations might form a session.

Several session guarantees can be made with high availability: 

Monotonic reads requires that, within a session, subsequent reads to a given object “never return any previous values”; reads from each item progress according to a total order (e.g., the order from Read Uncommitted). 

Monotonic writes requires that each session’s writes become visible in the order they were submitted. Any order on transactions (as in Read Uncommitted isolation) should also be consistent with any precedence that a global observer would see.

Writes Follow Reads requires that, if a session observes an effect of transaction T1 and subsequently commits transaction T2, then another session can only observe effects of T2 if it can also observe T1’s effects (or later values that supersede T1’s); this corresponds to Lamport’s “happens-before” relation [46]. Any order on transactions should respect this transitive order.

The above guarantees can be achieved by forcing servers to wait to reveal new writes (say, by buffering them in separate local storage) until each write’s respective dependencies are visible on all replicas. This mechanism effectively ensures that all clients read from a globally agreed upon lower bound on the versions written. This is highly available because a client will never block due to inability to find a server with a sufficiently up-to-date version of a data item. However, it does not imply that transactions will read their own writes or, in the presence of partitions, make forward progress through the version history. The problem is that under non-sticky availability, a system must handle the possibility that, under a partition, an unfortunate client will be forced to issue its next requests against a partitioned, out-of-date server.

A solution to this conundrum is to forgo high availability and settle for sticky availability. Sticky availability permits three additional guarantees, which we first define and then prove are unachievable in a generic highly available system: 

Read your writes requires that whenever a client reads a given data item after updating it, the read returns the updated value (or a value that overwrote the previously written value).

PRAM (Pipelined Random Access Memory) provides the illusion of serializing each of the operations (both reads and writes) within each session and is the combination of monotonic reads, monotonic writes, and read your writes [41].

Causal consistency [3] is the combination of all of the session guarantees [17] (alternatively, PRAM with writes-follow-reads) and is also referred to by Adya as PL-2L isolation [2]). 

Read your writes is not achievable in a highly available system. Consider a client that executes the following two transactions: 

T1 ∶ wx(1) 

T2 ∶ rx(a) 

If the client executes T1 against a server that is partitioned from the rest of the other servers, then, for transactional availability, the server must allow T1 to commit. If the same client subsequently executes T2 against the same (partitioned) server in the same session, then it will be able to read its writes. However, if the network topology changes and the client can only execute T2 on a different replica that is partitioned from the replica that executed T1, then the system will have to either stall indefinitely to allow the client to read her writes (violating transactional availability) or will have to sacrifice read your writes guarantees. However, if the client remains sticky with the server that executed T1, then we can disallow this scenario. Accordingly, read your writes, and, by proxy, causal consistency and PRAM require stickiness. Read your writes is provided by default in a sticky system. Causality and PRAM guarantees can be accomplished with well-known variants [3, 11, 48, 60, 67] of the prior session guarantee algorithms we presented earlier: only reveal new writes to clients when their (respective, model-specific) dependencies have been revealed.

5.1.4 Additional HAT Guarantees 

In this section, we briefly discuss two additional kinds of guarantees that are achievable in HAT systems.

 Consistency A HAT system can make limited application-level consistency guarantees. It can often execute commutative and logically monotonic [4] operations without the risk of invalidating application-level integrity constraints and can maintain limited criteria like foreign key constraints (via MAV). We do not describe the entire space of application-level consistency properties that are achievable (see Section 7) but we specifically evaluate TPC-C transaction semantics with HAT guarantees in Section 6.

Convergence Under arbitrary (but not infinite delays), HAT systems can ensure convergence, or eventual consistency: in the absence of new mutations to a data item, all servers should eventually agree on the value for each item [49, 64]. This is typically accomplished by any number of anti-entropy protocols, which periodically update neighboring servers with the latest value for each data item [31]. Establishing a final convergent value is related to determining a total order on transaction updates to each item, as in Read Uncommitted.

5.2 Unachievable HAT Semantics 

While there are infinitely many HAT models (Section 7), at this point, we have largely exhausted the range of achievable, previously defined (and useful) semantics that are available to HAT systems. Before summarizing our possibility results, we will present impossibility results for HATs, also defined in terms of previously identified isolation and consistency anomalies. Most notably, it is impossible to prevent Lost Update or Write Skew in a HAT system.

5.2.1 Unachievable ACID Isolation 

In this section, we demonstrate that preventing Lost Update and Write Skew - and therefore providing Snapshot Isolation, Repeatable Read, and one-copy serializability - inherently requires foregoing high availability guarantees. 

Berenson et al. define Lost Update as when one transaction T1 reads a given data item, a second transaction T2 updates the same data item, then T1 modifies the data item based on its original read of the data item, “missing” or “losing” T2’s newer update. Consider a database containing only the following transactions: 

T1 ∶ rx(a) wx(a + 2) 

T2 ∶ wx(2)

If T1 reads a = 1 but T2’s write to x precedes T1’s write operation, then the database will end up with a = 3, a state that could not have resulted in a serial execution due to T2’s “Lost Update.” 

It is impossible to prevent Lost Update in a highly available environment. Consider two clients who submit the following T1 and T2 on opposite sides of a network partition:

T1 ∶ rx(100) wx(100 + 20 = 120) 

T2 ∶ rx(100) wx(100 + 30 = 130) 

Regardless of whether x = 120 or x = 130 is chosen by a replica, the database state could not have arisen from a serial execution of T1 and T2. 4 To prevent this, either T1 or T2 should not have committed. Each client’s respective server might try to detect that another write occurred, but this requires knowing the version of the latest write to x. In our example, this reduces to a requirement for linearizability, which is, via Gilbert and Lynch’s proof of the CAP Theorem, provably at odds with high availability [35].

Write Skew is a generalization of Lost Update to multiple keys. It occurs when one transaction T1 reads a given data item x, a second transaction T2 reads a different data item y, then T1 writes to y and commits and T2 writes to x and commits. As an example of Write Skew, consider the following two transactions: 

T1 ∶ ry(0) wx(1) 

T2 ∶ rx(0) wy(1)

As Berenson et al. describe, if there was an integrity constraint between x and y such that only one of x or y should have value 1 at any given time, then this write skew would violate the constraint (which is preserved in serializable executions). Write skew is a somewhat esoteric anomaly—for example, it does not appear in TPC-C [34]—but, as a generalization of Lost Update, it is also unavailable to HAT systems.




No comments:

Post a Comment