Aug. 17, 2021
I like to take some notes and relax, and learn a few things.
The notes link is here.
I looked up the website for Ladis 2009, and then the link is here to slides.
Numbers everyone should know
- L1 cache reference 0.5 ns
- Branch mispredict 5 ns
- L2 cache reference 7 ns
- Mutex lock/ unlock 25 ns
- Main memory reference 100 ns
- Compress 1K bytes with Zippy 3,000 ns
- Send 2K bytes over 1 Gbps network 20,000 ns
- Read 1 MB sequentially from memory 250,000 ns
- Round trip within same datacenter 500,000 ns
- Disk seek 10,000,000 ns
- Read 1 MB sequentially from disk 20,000,000 ns
- Send packet CA->Netherlands->CA 150,000,000 ns
Designing efficient systems
Given a basic problem definition, how do you choose the "best" solution?
- Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
- without actually having to build it!
Architectural view of the storage hierarchy
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
Rack Switch
Local rack ( 80 servers)
DRAM: 1TB, 300us, 100MB/s
Disk: 160TB, 11ms, 100MB/s
Cluster (30+ racks)
DRAM: 30TB, 500us, 10MB/s
Disk: 4.80PB, 12ms, 10MB/s
Back of the envelope calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/ seek + 30 * 256K /30 MB/s = 560 ms
Design 2: Issues reads in parallel:
10 ms/ seek + 256K read / 30 MB/s = 18 ms
(Ignores variance, so really more like 30-60 ms, probably)
Lots of variations:
- caching (single images? whole sets of thumbnails?)
- pre-computing thumbnails
- ...
Know your basic building blocks
Core language libraries, basic data structure, protocol buffers, GFS, BigTable, indexing systems, MySQL, MapReduce, ...
Not just their interfaces, but understand their implementations (at least at a high level)
If you don't know what's going on, you can't do decent back-of-the-envelope calculations!
MapReduce
- A simple programming model that applies to many large-scale computing problems
- Hide messy details in MapReduce runtime library:
- automatic parallelization
- load balancing
- network and disk transfer optimizations
- handling of machine failures
- robustness
- improvements to core library benefit all users of library!
- Read a lot of data
- Map: extract something you care about from each record
- Shuffle and Sort
- Reduce: aggregate, summarize, filter, or transform
- Write the results
Outline stays the same, map and reduce change to fit the problem
BigTable: Motivation
- Lots of (semi-) structured data at Google
- URLs:
- contents, crawl metadata, links, anchors, pagerank, ...
- Per-user data:
- User preference settings, recent queries/search results, ...
- Geographic locations:
- Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, ...
- Scale is large
- billions of URLs, many versions/page (~20K/ version)
- Hundreds of millions of users, thousands of q/sec
- 100TB+ of satellite image data
Basic data model
- Distributed multi-dimensional sparse map (row, column, timestamp) -> cell contents
Rows are ordered lexicographically
Good match for most of our applications
BigTable status
- Design/initial implementation started beginning of 2004
- Production use or active development for 100+ projects:
- Google Print
- My Search History
- Orkut
- Crawling/indexing pipeline
- Google Maps/Google Earth
- Blogger
- ...
- Currently ~500 BigTable clusters
- Largest cluster:
- 70+ PB data; sustained: 10M ops/sec; 30+ GB/s I/O
Current work: Spanner
- Storage & computation system that spans all our datacenters
- single global namespace
- Names are independent of location(s) of data
- Similarities to Bigtable: tables, families, locality groups, coprocessors, ...
- Differences: hierarchical directories instead of rows, fine-grained replication
- Fine-grained ACLs, replication configuration at the per-directory level
- support mix of strong and weak consistency across datacenters
- strong consistency implemented with Paxos across tablet replicas
- Full support for distributed transactions across directories/machines
- much more automated operation
- system automatically moves and adds replicas of data and computation based on constraints and usage patterns
- automated allocation of resources across entire fleet of machines
Activities in world-wide systems
- Challenge: automatic, dynamic world-wide placement of data & computation to minimize latency and/or cost, given constraints on:
- bandwidth
- packet loss
- power
- resource usage
- failure modes
- ...
- User specify high-level desires:
- "99%ile latency for accessing this data should be <50ms"
- Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia
Building applications on top of weakly consistent storage systems
- Many applications need state replicated across a wide area
- For reliability and availability
- Two main choices:
- consistent operations (e.g. use Paxos)
- often imposes additional latency for common case
- inconsistent operations
- better performance/availability, but apps harder to write and reason about in this model
- Many apps need to use a mix of both of these:
- e.g. Gmail: marking a message as read is asynchronous, sending a message is a heavier-weight consistent operation
Building application on top of Weakly Consistent Storage Systems
- Challenge: General model of consistency choices, explained and codified
- ideally would have one or more "knobs" controlling performance vs. consistency
- "knob" would provide easy-to-understand tradeoffs
- Challenges: Easy-to-use abstractions for resolving conflicting updates to multiple versions of a piece of state
- Useful for reconciling client state with servers after disconnected operation
- Also useful for reconciling replicated state in different data centers after repairing a network partition
- Google File system, SOSP 2003
- Web search for a palnet: The Google Cluster Architecture, IEEE Micro, 2023
- OSDI 2004, MapReduce: Simplified Data processing on Large Clusters
- OSDI 2006, Bigtable: A distributed storage system for structured data
- OSDI 2006, The Chubby Lock service for loosely-coupled distributed systems
- FAST 2007, Failure trends in a large disk drive population
- EMNLP 2007, Large language models in Machine translation
- 2009, The datacenter as a computer: An introduction to the design of Warehouse-Scale machines
- 2009, PODC, Pregel: A system for large-scale graph processing
- SEGMETRICS'09, DRAM Errors in the Wild: A Large-Scale Field study
- Protocol buffers. http://code.goolge.com/p/protobuf/
No comments:
Post a Comment