Sunday, August 9, 2020

USENIG.org: TAO: Facebook’s Distributed Data Store for the Social Graph

August 9, 2020

Introduction

It is the first time I came cross the original conference paper. I like to read the paper as well. 

TAO: Facebook’s Distributed Data Store for the Social Graph

Here is the link of paper. Total pages are 12 pages. I like to spend 30 minutes to 60 minutes to read the paper. 

Statement: 

TAO, a read-optimized graph data store we have built to handle a demanding Facebook workload

read-optimized graph data store - 

Web server -> TAO data store -> Database, memcache as a lookaside cache - what is lookaside cache

MySQL - persistent storage - graph-aware cache - how to define graph-aware? 

CAP - AP - favors availability and per-machine efficiency over strong consistency. 

TAO can sustain a billion reads per second on a changing data set of many petabytes

efficient and available read-mostly access to a changing graph 

objects and associations, a data model and API that we use to access the graph

TAO - a distributed system that implements this API 

Facebook has more than a billion active users who record their relationships, share their interests, upload text, images, and video, and curate semantic information about their data [2]. The personalized experience of social applications comes from timely, efficient, and scalable access to this flood of data, the social graph. In this paper we introduce TAO, a read-optimized graph data store we have built to handle a demanding Facebook workload. Before TAO, Facebook’s web servers directly accessed MySQL to read or write the social graph, aggressively using memcache [21] as a lookaside cache. TAO implements a graph abstraction directly, allowing it to avoid some of the fundamental shortcomings of a lookaside cache architecture. TAO continues to use MySQL for persistent storage, but mediates access to the database and uses its own graph-aware cache. TAO is deployed at Facebook as a single geographically distributed instance. It has a minimal API and explicitly favors availability and per-machine efficiency over strong consistency; its novelty is its scale: TAO can sustain a billion reads per second on a changing data set of many petabytes. Overall, this paper makes three contributions. We motivate (§ 2) and characterize (§ 7) a challenging workload: efficient and available read-mostly access to a changing graph. We describe objects and associations, a data model and API that we use to access the graph (§ 3). Lastly, we detail TAO, a geographically distributed system that implements this API (§§ 4–6), and evaluate its performance on our workload (§ 8).

Let me try to read the 2.1 Serving the graph from Memcache

Facebook was originally built by storing the social graph in MySQL, querying it from PHP, and caching results in memcache [21]. This lookaside cache architecture is well suited to Facebook’s rapid iteration cycles, since all of the data mapping and cache-invalidation computations are in client code that is deployed frequently. Over time a PHP abstraction was developed that allowed developers to read and write the objects (nodes) and associations (edges) in the graph, and direct access to MySQL was deprecated for data types that fit the model. TAO is a service we constructed that directly implements the objects and associations model. We were motivated by encapsulation failures in the PHP API, by the opportunity to access the graph easily from non-PHP services, and by several fundamental problems with the lookaside cache architecture:

TAO is a service - objects and associations model 

Inefficient edge lists: A key-value cache is not a good semantic fit for lists of edges; queries must always fetch the entire edge list and changes to a single edge require the entire list to be reloaded. Basic list support in a lookaside cache would only address the first problem; something much more complicated is required to coordinate concurrent incremental updates to cached lists.

coordinate concurrent incremental updates to cached lists?  - how to coordinate? 

Distributed control logic: In a lookaside cache architecture the control logic is run on clients that don’t communicate with each other. This increases the number of failure modes, and makes it difficult to avoid thundering herds. Nishtala et al. provide an in-depth discussion of the problems and present leases, a general solution [21]. For objects and associations the fixed API allows us to move the control logic into the cache itself, where the problem can be solved more efficiently.

What is thundering herds? - 

Expensive read-after-write consistency: Facebook uses asynchronous master/slave replication for MySQL, which poses a problem for caches in data centers using a replica. Writes are forwarded to the master, but some time will elapse before they are reflected in the local replica. Nishtala et al.’s remote markers [21] track keys that are known to be stale, forwarding reads for those keys to the master region. By restricting the data model to objects and associations we can update the replica’s cache at write time, then use graph semantics to interpret cache maintenance messages from concurrent updates. This provides (in the absence of multiple failures) read-after-write consistency for all clients that share a cache, without requiring inter-regional communication.

asynchronous master/ slave replication for MySQL - caches in data centers using a replica - 

Writes are forwarded to the master, but some time ... 

update replica's cache at write time, then ...

read-after-write consistency - 


2.2 TAO’s Goal TAO provides basic access to the nodes and edges of a constantly changing graph in data centers across multiple regions. It is optimized heavily for reads, and explicitly favors efficiency and availability over consistency. A system like TAO is likely to be useful for any application domain that needs to efficiently generate fine grained customized content from highly interconnected data. The application should not expect the data to be stale in the common case, but should be able to tolerate it. Many social networks fit in this category


3 TAO Data Model and API 

Facebook focuses on people, actions, and relationships. We model these entities and connections as nodes and edges in a graph. This representation is very flexible; it directly models real-life objects, and can also be used to store an application’s internal implementation-specific data. TAO’s goal is not to support a complete set of graph queries, but to provide sufficient expressiveness to handle most application needs while allowing a scalable and efficient implementation. 


Consider the social networking example in Figure 1a, in which Alice used her mobile phone to record her visit to a famous landmark with Bob. She ‘checked in’ to the Golden Gate Bridge and ‘tagged’ Bob to indicate that he is with her. Cathy added a comment that David has ‘liked.’ The social graph includes the users (Alice, Bob, Cathy, and David), their relationships, their actions (checking in, commenting, and liking), and a physical location (the Golden Gate Bridge). Facebook’s application servers would query this event’s underlying nodes and edges every time it is rendered. Fine-grained privacy controls mean that each user may see a different view of the checkin: the individual nodes and edges that encode the activity can be reused for all of these views, but the aggregated content and the results of privacy checks cannot.

Social graph - users - Alice, Bob, Cathy, and David

their relationships, their actions - checking in, commenting, and liking

a physical location (the Golden Gate Bridge)

3.1 Objects and Associations

TAO objects are typed nodes, and TAO associations are typed directed edges between objects. Objects are identified by a 64-bit integer (id) that is unique across all objects, regardless of object type (otype). Associations are identified by the source object (id1), association type (atype) and destination object (id2). At most one association of a given type can exist between any two objects. Both objects and associations may contain data as key→value pairs. A per-type schema lists the possible keys, the value type, and a default value. Each association has a 32-bit time field, which plays a central role in queries1. 

Object: (id) → (otype, (key  value)∗) 

Assoc.: (id1, atype, id2) → (time, (key  value)∗) 

Figure 1b shows how TAO objects and associations might encode the example, with some data and times omitted for clarity. The example’s users are represented by objects, as are the checkin, the landmark, and Cathy’s comment. Associations capture the users’ friendships, authorship of the checkin and comment, and the binding between the checkin and its location and comments.

Actions may be encoded either as objects or associations. Both Cathy’s comment and David’s ‘like’ represent actions taken by a user, but only the comment results in a new object. Associations naturally model actions that can happen at most once or record state transitions, such as the acceptance of an event invitation, while repeatable actions are better represented as objects. 

Although associations are directed, it is common for an association to be tightly coupled with an inverse edge. In this example all of the associations have an inverse except for the link of type COMMENT. No inverse edge is required here since the application does not traverse from the comment to the CHECKIN object. Once the checkin’s id is known, rendering Figure 1a only requires traversing outbound associations. Discovering the checkin object, however, requires the inbound edges or that an id is stored in another Facebook system. 

The schemas for object and association types describe only the data contained in instances. They do not impose any restrictions on the edge types that can connect to a particular node type, or the node types that can terminate an edge type. The same atype is used to represent authorship of the checkin object and the comment object in Figure 1, for example. Self-edges are allowed.

3.2 Object API 

TAO’s object API provides operations to allocate a new object and id, and to retrieve, update, or delete the object associated with an id. A notable omission is a compare-and-set functionality, whose usefulness is substantially reduced by TAO’s eventual consistency semantics. The update operation can be applied to a subset of the fields. 

3.3 Association API 

Many edges in the social graph are bidirectional, either symmetrically like the example’s FRIEND relationship or asymmetrically like AUTHORED and AUTHORED BY. Bidirectional edges are modeled as two separate associations. TAO provides support for keeping associations in sync with their inverses, by allowing association types to be configured with an inverse type. For such associations, creations, updates, and deletions are automatically coupled with an operation on the inverse association. Symmetric bidirectional types are their own inverses. The association write operations are: 

• assoc add(id1, atype, id2, time, (k→v)*) – Adds or overwrites the association (id1, atype,id2), and its inverse (id1, inv(atype), id2) if defined. 

• assoc delete(id1, atype, id2) – Deletes the association (id1, atype, id2) and the inverse if it exists. 

• assoc change type(id1, atype, id2, newtype) – Changes the association (id1, atype, id2) to (id1, newtype, id2), if (id1, atype, id2) exists.


3.4 Association Query API

The starting point for any TAO association query is an originating object and an association type. This is the natural result of searching for a specific type of information about a particular object. Consider the example in Figure 1. In order to display the CHECKIN object, the application needs to enumerate all tagged users and the most recently added comments. 

A characteristic of the social graph is that most of the data is old, but many of the queries are for the newest subset. This creation-time locality arises whenever an application focuses on recent items. If the Alice in Figure 1 is a famous celebrity then there might be thousands of comments attached to her checkin, but only the most recent ones will be rendered by default. 


TAO’s association queries are organized around association lists. We define an association list to be the list of all associations with a particular id1 and atype, arranged in descending order by the time field: 

Association List: (id1, atype) → [anew ...aold] 

For example, the list (i, COMMENT) has edges to the example’s comments about i, most recent first. 

TAO’s queries on associations lists: 

• assoc get(id1, atype, id2set, high?, low?) – returns all of the associations (id1, atype, id2) and their time and data, where id2 ∈ id2set and high ≥ time ≥ low (if specified). The optional time bounds are to improve cacheability for large association lists (see § 5). 

• assoc count(id1, atype) – returns the size of the association list for (id1, atype), which is the number of edges of type atype that originate at id1. 

• assoc range(id1, atype, pos, limit) – returns elements of the (id1, atype) association list with index i ∈ [pos,pos+limit). 

• assoc time range(id1, atype, high, low, limit) – returns elements from the (id1, atype) association list, starting with the first association where time ≤ high, returning only edges where time ≥ low. 


TAO enforces a per-atype upper bound (typically 6,000) on the actual limit used for an association query. To enumerate the elements of a longer association list the client must issue multiple queries, using pos or high to specify a starting point. 

For the example shown in Figure 1 we can map some possible queries to the TAO API as follows: 

• “50 most recent comments on Alice’s checkin” ⇒ assoc range(632, COMMENT, 0, 50) • “How many checkins at the GG Bridge?” ⇒ assoc count(534, CHECKIN)

4 TAO Architecture 

In this section we describe the units that make up TAO, and the multiple layers of aggregation that allow it to scale across data centers and geographic regions. TAO is separated into two caching layers and a storage layer. 

4.1 Storage Layer 

Objects and associations were stored in MySQL at Facebook even before TAO was built; it was the backing store for the original PHP implementation of the API. This made it the natural choice for TAO’s persistent storage. 

The TAO API is mapped to a small set of simple SQL queries, but it could also be mapped efficiently to range scans in a non-SQL data storage system such as LevelDB [3] by explicitly maintaining the required indexes. When evaluating the suitability of a backing store for TAO, however, it is important to consider the data accesses that don’t use the API. These include backups, bulk import and deletion of data, bulk migrations from one data format to another, replica creation, asynchronous replication, consistency monitoring tools, and operational debugging. An alternate store would also have to provide atomic write transactions, efficient granular writes, and few latency outliers. 

Database - backups, bulk import and deletion of data, bulk migration from one data format to another, replica creation, asynchronous replication, consistency monitoring tools, and operational debugging. 



Given that TAO needs to handle a far larger volume of data than can be stored on a single MySQL server, we divide data into logical shards. Each shard is contained in a logical database. Database servers are responsible for one or more shards. In practice, the number of shards far exceeds the number of servers; we tune the shard to server mapping to balance load across different hosts. By default all object types are stored in one table, and all association types in another. 

Each object id contains an embedded shard id that identifies its hosting shard. Objects are bound to a shard for their entire lifetime. An association is stored on the shard of its id1, so that every association query can be served from a single server. Two ids are unlikely to map to the same server unless they were explicitly colocated at creation time.

Actionable Items

I finished reading at 11:23 PM. So it took me more than two hours. 

I do think that it is better for me to understand the paper; Best way is to listen the presentation again after paper reading. 



No comments:

Post a Comment