Saturday, August 15, 2020

Tech titan Barry Diller discusses the future of tech, media, and the internet economy

 Here is the link. 

Barry Diller, Chairman & Senior Executive, IAC & Expedia, sat down with CNBC Reporter Julia Boorstin to talk the future of tech, media, and net neutrality at Internet Association’s 2017 Virtuous Circle Conference.

Friday, August 14, 2020

Barry Diller: Guard against arrogance by forgetting your success

 Here is the link. 

— "I remember thinking, I see the people around me getting really arrogant. And I learned how awful a poison arrogance is. And so one of the ways you guard against arrogance is to forget success. So I just always tried to brush it clean. Sometimes successfully." — •Yahoo Finance Editor-in-Chief Andy Serwer sits down with Barry Diller, IAC chairman and senior executive. Diller offers his take on the different media mergers and the evolution of the industry, while praising Netflix. He calls for more tech regulation, but suggests that the public has overreacted over Facebook. The media mogul looks back on his time with Fox and praises Rupert Murdoch for his vision and courage. Diller slams President Donald Trump, calling his win of the 2016 election, an "accident."

What Barry Diller Knows About Travel

 Here is the link. 

WSJ's Scott McCartney joins Tanya Rivero to discuss his interview with IAC and Expedia chairman Barry Diller about where travel is headed next. Photo: Getty

Barry Diller On President Donald Trump: Hopefully Will Be Over Soon | CNBC

 Here is the link. 

Barry Diller on Finding Good Leaders

 Here is the link. 

We are honored to have Barry Diller, chairman of IAC and Expedia, and Bobby Kotick, CEO of Activision Blizzard, among our small group of luminary LPs whose financial capital and engagement power the next wave of Village Global founders. They joined us in Beverly Hills, California, for a discussion about leadership, lessons learned from their entrepreneurial journeys, and the future of media, gaming and entertainment.

MGM Resorts shares rise after $1 billion investment from Barry Diller’s IAC

 Here is the article. 

What I learn from the billionaire investor: 

  1. Barry Diller’s IAC/InterActive said on Monday it has bought a 12% stake in MGM Resorts International for about $1 billion, sending the casino operator’s shares soaring 14%.
  2. MGM - 91% fall of revenue, 35% sunk stock price
  3. $450 billion global industry - onine gaming space
  4. Share of IAC - ? 

  • Barry Diller’s IAC/InterActive said on Monday it has bought a 12% stake in MGM Resorts International for about $1 billion, sending the casino operator’s shares soaring 14%.
  • The investment comes at a time when the gambling industry has been ravaged by government restrictions on movement due to the Covid-19 pandemic, as well as fears about public gatherings.
  • Barry Diller’s IAC/InterActive said on Monday it has bought a 12% stake in MGM Resorts International for about $1 billion, sending the casino operator’s shares soaring 14%.

    The investment comes at a time when the gambling industry has been ravaged by government restrictions on movement due to the Covid-19 pandemic, as well as fears about public gatherings.

    MGM reported a 91% fall in revenue in the latest reported quarter and has slashed its dividend to weather the impact of the health crisis on its financials. Shares of the company have sunk more than 35% this year.

MGM’s online gaming business, which currently constitutes a tiny portion of its revenue, was what initially attracted Diller, the billionaire told shareholders in a letter.

“MGM presented a ‘once in a decade’ opportunity for IAC to own a meaningful piece of a preeminent brand in a large category with great potential to move online.”

Diller added he has followed the online gaming space for a while, looking for an opportunity to enter the $450 billion global industry.

IAC has a history of building businesses and later splitting them into separate companies — travel site Expedia Group and ticket booking site Ticketmaster are some examples.

Shares of IAC, which recently completed separation of Match Group into a separate company, fell about 1% after the news.

Leetcode discuss: 76. Minimum Window Substring

 Here is discussion post. 

C# using extra map to help track sliding window

August 14, 2020
76. Minimum Window Substring
It is risky to try a new idea for this hard level algorithm. I like to make coding experience normal, instead of memorizing the past practice. I chose to write normal maintenance of hashmap first, instead of focusing on left pointer continuously sliding right while loop as the first thing in my past practice. I chose to use the idea to clone a hashmap for pattern string, but I forgot the while loop and it's hidden case if the loop falls back to outside one.

To make sure the code works and bug free, there are a list of things to check. Although the idea is working fine, I have to learn how to avoid bugs in my first writing.

It is better to write double loops for this sliding window, left pointer needs to slide more than once sometimes. For this use case, it is better to work on a while loop inside outside while loop with condition index < length. It is more explicitly than hidden one, otherwise I need to add index visited hashset to make sure that index position right pointer will not be processed more than once.

Hard level algorithm
This algorithm is a hard level. It is better for me to write down my practice lessons first, and then I will be better to design the solution considering all the edge cases.

I took the advise to solve all easy level algorithms first in 2018, and leave hard level algorithms later. I do believe that there are more opportunities to learn things from easy level algorithms. So I was able to practice more than 200 easy level and medium level algorithms first, and solved near 500 algorithms.

public class Solution {
    /// <summary>
        /// August 14, 2020
        /// The idea is to write a simple solution, using extra space to make calculation easy; 
        /// Make the problem a computer science data structure, avoid calculation if I can. 
        /// </summary>
        /// <param name="source"></param>
        /// <param name="t"></param>
        /// <returns></returns>
        /// "acbd"
        /// "abc"
        public string MinWindow(string source, string pattern)
            if (source == null || source.Length == 0 || pattern == null || pattern.Length == 0)
                return string.Empty; 

            // using slide window technique, the time complexity will be O(N)
            // variables design: slide window two pointers: left, index
            // a hashmap containing all pattern distinct char and it's count
            // copy of a hashmap to track current window's progress
            // O(1) time to check if the slide window containing all distinct chars and required count
            var patternMap = new Dictionary<char, int>();
            var copyPatternMap = new Dictionary<char, int>();
            foreach (var item in pattern)
                if (!patternMap.ContainsKey(item))
                    patternMap.Add(item, 0);

            }   // 'a'=1,'b'=1,'c'=1

            foreach (var pair in patternMap)
                copyPatternMap.Add(pair.Key, pair.Value);
            } // 'a'=1,'b'=1,'c'=1

            var length = source.Length;

            var left = 0;
            var index = 0;
            var minLength = length + 1;
            var startMin = -1;
            var map = new Dictionary<char, int>(); 

            while (index < length)
                var current = source[index]; // 'a'
                if (!map.ContainsKey(current)) //
                    map.Add(current, 0);

                map[current]++; //'a'=1

                if (copyPatternMap.ContainsKey(current))
                    if (copyPatternMap[current] == 0)

                // Find all distinct chars and minimum number for each char
                while (copyPatternMap.Count == 0)
                    var currentWindow = index - left + 1;
                    if (currentWindow < minLength)
                        minLength = currentWindow;
                        startMin  = left;

                    // move left point by one
                    var leftChar = source[left];

                    // remove from map
                    var leftCount = map[leftChar];
                    if (map[leftChar] == 0)

                    // determine if leftChar should be added to copyPatternMap
                    var needToAdd = patternMap.ContainsKey(leftChar) &&
                                    leftCount < patternMap[leftChar];
                    if (needToAdd)
                        if (!copyPatternMap.ContainsKey(leftChar))
                            copyPatternMap.Add(leftChar, 0);



            return minLength > length ? string.Empty : source.Substring(startMin, minLength); 

Can I be analyst of airline industry?

 August 14, 2020


It is a story of a beginner investor. I like to speculate the lowest point in second quarter earning day, Canada airline stock, so I wondered if I should purchase all my cash $70,000 dollar on Canada airline stocks, and have the 10% rebound after earning day 5% losses. What I like to do is to look into how I can be an analyst of airline industry in this coronavirus shutdown of US/ Canada border, how to track daily volume of airline tickets sold etc. I like to go back to the basics instead of speculation. 

Being an analyst of airline industry

I try to figure out basic things I need to work on as an analyst. I should be good at finance reports, and many other things will help. 

Cisco, Exxon Mobil share losses contribute to Dow's 150-point drop

 Here is the link. 

Shares of Cisco and Exxon Mobil are trading lower Thursday afternoon, sending the Dow Jones Industrial Average into negative territory. Shares of Cisco CSCO, -0.43% and Exxon Mobil XOM, -0.14% are contributing to the blue-chip gauge's intraday decline, as the Dow DJIA, 0.15% was most recently trading 158 points, or 0.6%, lower. Cisco's shares are down $5.60, or 11.6%, while those of Exxon Mobil are off $1.02, or 2.3%, combining for an approximately 45-point drag on the Dow. Walgreens Boots WBA, 1.48%, Goldman Sachs GS, -0.14%, and American Express AXP, 0.36% are also contributing significantly to the decline. A $1 move in any of the benchmark's 30 components results in a 6.86-point swing.

I grew up and being independent

 August 14, 2020


It is tough situation in my personal life. I have to deal with middle age crisis, and I could not be independent financially. Things do not go to my way as expected all the time. 

Being independent

It is tough things in my life being single. I started to figure out what makes my life being struggle and stay as poor as I can. 

I tried to ask one of six share of home value since my mom passed away. I paid whole unit back in 1999, and I have siblings in China, they are doctor, high school teacher and college teacher. They all are beneficial from house boom in China. I could not get anything, but I got laughed and cursed. No one likes to step out saying that what is law and what we should follow. 

This is a serious issue in China, and also me as a personal. I am doing personal finance research, take risk to invest and also being frugal, stay as minimalist. 

I have to be independent, and also save my time on more responsible people. Sometimes people like to take advantage of others, and if I do not keep records, they will come back and I will be second-time victim again. 

Being single, no child is so challenge. I have to figure out how to get back. I did quick research called "who is my family as a single?". 

Wednesday, August 12, 2020

Two-phase commit protocol

 Here is the wiki article. 

Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

I am studying Google spanner - distributed database, and then I have to learn the basic concept about this two-phase commit protocol. 

In transaction processingdatabases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction (it is a specialized type of consensus protocol). The protocol achieves its goal even in many cases of temporary system failure (involving either process, network node, communication, etc. failures), and is thus widely used.[1][2][3] However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed to remedy an outcome. To accommodate recovery from failure (automatic in most cases) the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. Many protocol variants exist that primarily differ in logging strategies and recovery mechanisms. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol.

In a "normal execution" of any single distributed transaction (i.e., when no failure occurs, which is typically the most frequent situation), the protocol consists of two phases:

  1. The commit-request phase (or voting phase), in which a coordinator process attempts to prepare all the transaction's participating processes (named participantscohorts, or workers) to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit (if the transaction participant's local portion execution has ended properly), or "No": abort (if a problem has been detected with the local portion), and
  2. The commit phase, in which, based on voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the participants. The participants then follow with the needed actions (commit or abort) with their local transactional resources (also called recoverable resources; e.g., database data) and their respective portions in the transaction's other output (if applicable).

Note that the two-phase commit (2PC) protocol should not be confused with the two-phase locking (2PL) protocol, a concurrency control protocol.


The protocol works in the following manner: one node is a designated coordinator, which is the master site, and the rest of the nodes in the network are designated the participants. The protocol assumes that there is stable storage at each node with a write-ahead log, that no node crashes forever, that the data in the write-ahead log is never lost or corrupted in a crash, and that any two nodes can communicate with each other. The last assumption is not too restrictive, as network communication can typically be rerouted. The first two assumptions are much stronger; if a node is totally destroyed then data can be lost.

The protocol is initiated by the coordinator after the last step of the transaction has been reached. The participants then respond with an agreement message or an abort message depending on whether the transaction has been processed successfully at the participant.

Cloud Spanner: TrueTime and external consistency

 Here is the link. 

How to find good content for system design?

 August 12, 2020


It is hard for me to push myself to develop strong interest in learning system design. I have spent over three days to study Facebook TAO, now I also start to watch a good and high quality video about distributed database - Spanner provided by Google. So I like to talk about how I find the video and the talk, research effort in terms of Google to build Spanner. 

How to find good content for system design? 

Life is tough for me since I have to work full time, and learn to invest on stock market as a beginner. I also have to push myself to practice data structure and algorithm, and then work on system design in rush. I like to find some good study material, and get inspired to work hard on learning process. 

The ultimate goal to learn system design is to develop talent for understanding the challenge of world called large distributed system. I also like to work on basics first, so that I can have better foundation to think by myself. 

System design: OSDI12 - Spanner: Google’s Globally-Distributed Database

 Here is the link for 30 minutes tech talk. 

Please watch the video first 15 minutes 10 times. Write down what I learn, what I should look into. 

Example: Social network

Research contribution to design spanner - 

Shard thousands ways, replicate three ways, 10 thousand data centers 


Feature: Lock-free distributed read transaction

Property: External consistency of distributed transaction 

- First system at global scale

Implementation: integration of concurrency control, replication, and 2PC

- correctness and performance

Enable technology: TrueTime

-interval-based global time

Facebook and memcached - Tech Talk

 Here is the link. 

The FB evolution of memcache 

I do believe that this presentation gave me so much encouragement to prepare system design. Move fast, and also make things work. 

End TCP, use UDP to allow memory used for cache to store data instead of saving those opening connections. 

mcproxy: still reduction in packet I/O connections. 

Dynamically adjust the timeout and

# of keys we send out in UDP ......

reusing mcproxy

- mxproxy supports connection sharing from all process on a server

- TCP for sets/ delete

2 millisecond cache - instead of database call 4 milliseconds 

mcproxy maintain 

West coast 

The next Era: server optimizations

Storage memory optimization

System CPU time optimization 

Problem: Excessive system calls - multiple "write" per response

Threading and kernel optimization

System Design Interview - Distributed Cache

 Here is the link. 

Topics mentioned in the video: - Functional (put, get) and non-functional (high scalability, high availability, high performance) requirements. - Least recently used (LRU) cache. - Dedicated cache cluster, co-located cache - MOD hashing, consistent hashing. - Cache client. - Static vs dynamic cache servers list configuration. - Leader-follower (master-slave) data replication. - Cache consistency, data expiration, local and remote cache, security, monitoring and logging. - Memcached, Redis. Inspired by the following interview questions: Adobe ( Amazon ( Ebay ( Google ( Yahoo (

System design: Building Scalable Caching Systems via McRouter - @Scale 2014 - Data

 Here is the link. 

Rajesh Nishtala, Engineer at Facebook and Ricky Ramirez, Engineer at Reddit Modern large scale web infrastructures rely heavily on distributed caching (e.g memcached) to process user requests. The problems that McRouter addresses are not specific to Facebook, but distributed caching systems in general. As a result, Instagram and Reddit have also adopted McRouter as the primary communication layer to their cache tiers.

Consistent hashing

Connection pooling

Server poolsAutomatic failover

cold cache warmup

Broadcast operations

Replicated data sets

Shadow production traffic

Composable routing 

Fan-out (software)

 Here is the link. 

In message-oriented middleware solutions, fan-out is a messaging pattern used to model an information exchange that implies the delivery (or spreading) of a message to one or multiple destinations possibly in parallel, and not halting the process that executes the messaging to wait for any response to that message.[1][2][3]

In software construction, the fan-out of a class or method is the number of other classes used by that class or the number of other methods called by that method.[4]

System design: Scaling Memcache at Facebook

 Here is the link. 

I spent near one hour to read the paper, but I do think that it is better for me to get some coaching from a talk. I found two quickly. 

This one should be the great one for me to learn how to scale Memcache at Facebook. 

handling updates

Memcache needs to be invalidated after DB write

Prefer deletes to sets

- idempotent

-demand filled

Up to web application to specify which keys to invalidate after database update

Problems with look-aside caching

stale sets - try to explain this permanent cache key value - until it is deleted

Extend memcache protocol with "leases

- Return and attach a lease-id with every miss

-lease-id is invalidated inside server on a delete

- Disallow set if the lease id is invalid at the server

Need even more read capacity

Incast congestion

memcache - web server 

Key largo portofolio: August 12, 2020

 August 12, 2020


It is a busy day, I have to work on paper reading: memcache distributed challenge, and also I purchased 100 shares of CLVS stock. 

Key largo portfolio

I have to learn how to manage the portfolio. As a beginner, I learn how to manage myself as a beginner. I have 100% equities, and also I put 50% stocks on airline, hotel and distressed stocks. 

AC stock: My lesson as a beginner

 August 12, 2020


As a beginner to invest on Canada airline stock, it is important for me to recognize the time, how to prepare for the project of purchase, how to prepare the time to sell the stock. I like to write down my lesson today. 

My last purchase

I did purchase 300 shares of AC.TO stock on July 31, 2020, and sold all 300 shares on August 10, 2020 at price of 16.3. My gain is $336.00. 

Now AC.TO stock continued to go up with 11.66% gains. 

I should think about more carefully about my purchase, and sell. I should purchase more around $25,000 Canadian dollars, and also I need to wait until there is 10% gains. 

Clovis (CLVS) Q2 Earnings Lag, Revenues Show Coronavirus Impact

 Here is the link. 

I like to purchase 100 shares of CLVS stock. 

Scaling Memcache at Facebook

 Here is the paper. 

My goal of learning is to understand TAO data store, and also review TCP/UDP networking. I like to prepare system design, and it is so important for me to read one or two papers, understand one system design like graph database - TAO. 

I need to start from somewhere as a researcher, system design is not too tough for me to build strong interest on. I should not treat it as a stepping stone for me to get a job from Facebook. I should think it as my career, and I should prepare myself to be a better researcher, and also engineer. 

Compared to crafting skills, system design is also very challenge and interesting project to work on. 

We structure our paper to emphasize the themes that emerge at three different deployment scales. Our read heavy workload and wide fan-out is the primary concern when we have one cluster of servers. As it becomes necessary to scale to multiple front-end clusters, we address data replication between these clusters. Finally, we describe mechanisms to provide a consistent user experience as we spread clusters around the world. Operational complexity and fault tolerance is important at all scales. We present salient data that supports our design decisions and refer the reader to work by Atikoglu et al. [8] for a more detailed analysis of our workload. At a high-level, Figure 2 illustrates this final architecture in which we organize co-located clusters into a region and designate a master region that provides a data stream to keep non-master regions up-to-date.


3.1 Reducing Latency

We provision hundreds of memcached servers in a cluster to reduce load on databases and other services. Items are distributed across the memcached servers through consistent hashing [22]. Thus web servers have to routinely communicate with many memcached servers to satisfy a user request. As a result, all web servers communicate with every memcached server in a short period of time. This all-to-all communication pattern can cause incast congestion [30] or allow a single server to become the bottleneck for many web servers. Data replication often alleviates the single-server bottleneck but leads to significant memory inefficiencies in the common case.

We reduce latency mainly by focusing on the memcache client, which runs on each web server. This client serves a range of functions, including serialization, compression, request routing, error handling, and request batching. Clients maintain a map of all available servers, which is updated through an auxiliary configuration system.

Parallel requests and batching: 

We structure our web application code to minimize the number of network round trips necessary to respond to page requests. We construct a directed acyclic graph (DAG) representing the dependencies between data. A web server uses this DAG to maximize the number of items that can be fetched concurrently. On average these batches consist of 24 keys per request2.

Client-server communication: 

Memcached servers do not communicate with each other. When appropriate, we embed the complexity of the system into a stateless client rather than in the memcached servers. This greatly simplifies memcached and allows us to focus on making it highly performant for a more limited use case. Keeping the clients stateless enables rapid iteration in the software and simplifies our deployment process. Client logic is provided as two components: a library that can be embedded into applications or as a standalone proxy named mcrouter. This proxy presents a memcached server interface and routes the requests/replies to/from other servers.

a standalone proxy named mcrouter - mcrouter - what is the proxy name?

Clients use UDP and TCP to communicate with memcached servers. We rely on UDP for get requests to reduce latency and overhead. Since UDP is connectionless, each thread in the web server is allowed to directly communicate with memcached servers directly, bypassing mcrouter, without establishing and maintaining a connection thereby reducing the overhead. The UDP implementation detects packets that are dropped or received out of order (using sequence numbers) and treats them as errors on the client side. It does not provide any mechanism to try to recover from them. In our infrastructure, we find this decision to be practical. Under peak load, memcache clients observe that 0.25% of get requests are discarded. About 80% of these drops are due to late or dropped packets, while the remainder are due to out of order delivery. Clients treat get errors as cache misses, but web servers will skip inserting entries into memcached after querying for data to avoid putting additional load on a possibly overloaded network or server.

For reliability, clients perform set and delete operations over TCP through an instance of mcrouter running on the same machine as the web server. For operations where we need to confirm a state change (updates and deletes) TCP alleviates the need to add a retry mechanism to our UDP implementation.

Web servers rely on a high degree of parallelism and over-subscription to achieve high throughput. The high memory demands of open TCP connections makes it prohibitively expensive to have an open connection between every web thread and memcached server without some form of connection coalescing via mcrouter. Coalescing these connections improves the efficiency of the server by reducing the network, CPU and memory resources needed by high throughput TCP connections. Figure 3 shows the average, median, and 95th percentile latencies of web servers in production getting keys over UDP and through mcrouter via TCP. In all cases, the standard deviation from these averages was less than 1%. As the data show, relying on UDP can lead to a 20% reduction in latency to serve requests.

For reliability, clients perform set and delete operations over TCP through an instance of mcrouter running on the same machine as the web server. For operations where we need to confirm a state change (updates and deletes) TCP alleviates the need to add a retry mechanism to our UDP implementation.

Web servers rely on a high degree of parallelism and over-subscription to achieve high throughput. The high memory demands of open TCP connections makes it prohibitively expensive to have an open connection between every web thread and memcached server without some form of connection coalescing via mcrouter. Coalescing these connections improves the efficiency of the server by reducing the network, CPU and memory resources needed by high throughput TCP connections. Figure 3 shows the average, median, and 95th percentile latencies of web servers in production getting keys over UDP and through mcrouter via TCP. In all cases, the standard deviation from these averages was less than 1%. As the data show, relying on UDP can lead to a 20% reduction in latency to serve requests.

Incast congestion: Memcache clients implement flow control mechanisms to limit incast congestion. When a client requests a large number of keys, the responses can overwhelm components such as rack and cluster switches if those responses arrive all at once. Clients therefore use a sliding window mechanism [11] to control the number of outstanding requests. When the client receives a response, the next request can be sent. Similar to TCP’s congestion control, the size of this sliding window grows slowly upon a successful request and shrinks when a request goes unanswered. The window applies to all memcache requests independently of destination; whereas TCP windows apply only to a single stream.


3.2.3 Replication Within Pools 

Within some pools, we use replication to improve the latency and efficiency of memcached servers. We choose to replicate a category of keys within a pool when (1) the application routinely fetches many keys simultaneously, (2) the entire data set fits in one or two memcached servers and (3) the request rate is much higher than what a single server can manage. We favor replication in this instance over further dividing the key space. Consider a memcached server holding 100 items and capable of responding to 500k requests per second. Each request asks for 100 keys. The difference in memcached overhead for retrieving 100 keys per request instead of 1 key is small. To scale the system to process 1M requests/sec, suppose that we add a second server and split the key space equally between the two. Clients now need to split each request for 100 keys into two parallel requests for ∼50 keys. Consequently, both servers still have to process 1M requests per second. However, if we replicate all 100 keys to multiple servers, a client’s request for 100 keys can be sent to any replica. This reduces the load per server to 500k requests per second. Each client chooses replicas based on its own IP address. This approach requires delivering invalidations to all replicas to maintain consistency.

Why Jumia Technologies Stock Plunged Today

 Here is the article. 

I plan to read the article when I like to take five minutes break. 

TAO — Facebook’s Distributed database for Social Graph

 Here is the article. 

I think that the article is very helpful for me to understand tech talk TAO. Compared to reading the paper, it is better for me to read the writing from the author. 

It is my disadvantage since I do not put my hands on large distributed system and I need to rely on different resources to help me to advance my interest. 

The content is from the article. Well-written, nice reading time. 

I will be covering the architecture and key design principles outlined in the paper that came out of Facebook on graph databases. This is an attempt to summarize the architecture of a highly scalable graph database that can support objects and their associations, for a read heavy workload consisting of billions of transactions per second. In facebook’s case, reads comprise of more than 99% of the requests and write are less than a percent.


Facebook has billions of users and most of these users consume content more often than they create content. So obviously their workload is read heavy. So they initially implemented a distributed lookaside cache using memcached, which this paper references a lot. In this workload, a lookaside cache is used to support all the reads and writes will go to the database. A good cache-hit rate ensures a good performance and doesn’t overload the database. The following figure shows how a memcache based lookaside cache is used at facebook for optimizing reads.

Microsoft: The Obvious Acquisition Is Dropbox, Not TikTok

Here is the article. 

From a business perspective, Dropbox as a cloud storage platform may have been redundant for Microsoft, given its existing OneDrive offering. Over the past few quarters, however, Dropbox has expanded deeper into digital collaboration, having introduced features like Spaces (content management within teams), electronic signing, vault for secure document storage, among other features. In other words, Dropbox now brings more to the table than it used to, only a couple of years ago.

But maybe even more important, an acquisition of Dropbox would bring 600 million registered users to Microsoft - granted, some of whom probably overlap with Microsoft's existing user base. Because the tech giant might have quite a bit of cross-selling opportunities to explore, it may not take much for Microsoft to justify a bid for a user base of this size. For example, merely producing $1 extra in net earnings per year from each "potential high value target" user that Dropbox does not currently monetize could unlock more value than DBX is worth in the market today (here, I am simply applying an earnings multiple of 25x to $335 million in annual net income).


Cassandra: Distributed key-value store optimized for write-heavy workloads

 Here is the article. 

I like to quickly go over the content written by Ame, and then write down my highlights here:

  1. Cassandra is a popular distributed key value store, built initially at Facebook using commodity severs for allowing users to search through their inbox messages.
  2. While TAO, which i covered here, was optimized for reads
  3. Cassandra is optimized for write heavy workload while maintaining a good performance for reads
  4. Google's bigTable vs Cassandra - what I can remember? 
  5. Data model - multi-dimensional key-value map

Cassandra is a popular distributed key value store, built initially at Facebook using commodity severs for allowing users to search through their inbox messages. While TAO, which i covered here, was optimized for reads, Cassandra is optimized for write heavy workload while maintaining a good performance for reads. Some of the key design objectives of Cassandra seem to be:

  1. While writes are in the orders of billions and need to be optimized for using append-only logs, reads can be very geographically diverse. Hence cross-data center replication is necessary and important for good read efficiency.
  2. Scaling of platform by adding commodity servers, as more users get added to the platform.
  3. Fault tolerance should be the norm.
  4. Giving clients of the database a simple interface and more control over type of schema, availability and performance etc.

Google’s bigtable also serves somewhat similar overall purpose. But the key difference here is that bigtable was built on top of GFS which provides durability of data. Durability of data is built into Cassandra via explicit replication mechanisms.

Data Model

Cassandra like BigTable provides a fairly simple data model. It consists of small keys of tens of bytes in size and some structured format of client’s liking. Keys are mapped to a sets of columns called column families. This way Cassandra data model can be viewed as multi dimensional key-value map.

In cassandra, any row mutation is atomic. So when updating a key and corresponding columns/column-families, all the data is written or none. Columns can be sorted via time e.g. latest messages can be at the top.

The simple API(and fairly self-explanatory) for accessing Cassandra is:

insert(table, key, rowMutation)

get(table, key, columnName)

delete(table, key, columnName)

Request Flow or a key

In Cassandra, for both reads and writes, any end-user request can land on any node. This node is then aware of the replicas that “own” this keyspace. Then the node will route the request to these replicas.

In case of a write operation, replication is important for durability of the data. Hence the node where the request initially landed waits to establish quorum from the responses from replicas.

For reads, this requirement can be relaxed depending on the requirements of the application. For a client app, not too worried about strong consistency, can respond back when the first response arrives from the replica or can wait for establishing the quorum.


Any large distributed storage system needs to worry about Persistence or Durability of data, Fault tolerance, Partitioning of the data, Addition or removal of nodes from the system, Scaling. Let’s talk through some of the important aspects. One key fact to keep in mind is that all nodes in Cassandra are aware of each other via Gossip based protocols.

Partitioning — Mapping keys to Nodes/Servers

Cassandra needs to be able to scale by adding more servers and also needs to adjust to failures of nodes without compromising the performance. Hence Cassandra uses consistent hashing for mapping keys to Servers/Nodes. The advantage of consistent hashing is that if a node is added or removed from the system, then all the existing key-server mapping are NOT impacted like in traditional hashing methods. Only the neighbor nodes are impacted and the redistribution of keys occurs among neighbors.

In this scheme, each node gets assigned to some keyrange. Assume that the entire keyrange can be mapped onto the circumference of a circle. So when a key arrives, it gets mapped to some point on the circumference of the circle. Then the algorithm walks clockwise until it finds the next node that is mapped onto the circle. This node is now the coordinator for this key. When this node goes down for some reason, the next clockwise node on the circumference will become the coordinator for the related keyrange. Since different nodes on the circle will have different load, Cassandra allows for lightly loaded nodes to move closer to the heavily loaded nodes.