Here is the link.
Normally simple tasks like running a program or storing and retrieving data become much more complicated when we start to do them on collections of computers, rather than single machines. Distributed systems has become a key architectural concern, and affects everything a program would normally do—giving us enormous power, but at the cost of increased complexity as well.
Using a series of examples all set in a coffee shop, we’ll explore topics like distributed storage, computation, timing, messaging, and consensus. You'll leave with a good grasp of each of these problems, and a solid understanding of the ecosystem of open-source tools in the space.
Three characteristic
The computers operate concurrently
The computers fail independently
The computers do not share a global clock
Three topics
Storage
Computation
Messaging
Single-master storage
More read than write
Read replication
Distributed database, replicate two databases - use space buy time, short time to read
Eventually consistent database
Master, lead server
10:22
Sharding
Break data model - cannot join cross shards -
How to take joins away? Denormalize, read is slow. To add Index is not option.
Consistent hashing - technique used
Cassandra database -
Replication -> two copies - distributed system, computer fails - I solved the problem, be up, tolerant of hardware failure
Consistency - three copies of data.
A rule -
Consistency
R+W> N
N - number of replicas
Strongly consistent database
19:40/48:59
CAP Theorem
Simple way to understand - distributed database -
Consistency
Available - read works, write works
Partition tolerance
Shared writing project
Coffee shop closes
Synchronizing over the phone
Battery dies
Status report
Choose unavailable - Give you an answer, but it may be wrong
Give up consistency
Sometimes we can not have three
Distributed computation
One process -
MapReduce
Map
Counting words,
Shuffle
similar words near each other
Reduce - add up the number,
300 computers, replicated all
Hadoop
MapReduce API
MapReduce job management
Distributed Filesystem (HDFS)
Enormous ecosystem
Spark
Scatter/gather paradigm (similar to MapReduce)
More general data model (RDDs, DataSets)
More general programming model(transform/action)
Storage agnostic
Kakfa
Everything is stream
Focuses on real-time analysis, not batch jobs
Streams and streams only
Except streams are also tables (sometimes)
No cluster required
Messaging
Means of loosely coupling subsystems
Messages consumed by subscribers
Created by one or more producers
Organized into topics
Processed by brokers
Usually persistent over the short term
Messaging problems
What if a topic gets too big for one computer?
What if one computer is not reliable
Apache Kafka
Definitions
Message: an immutable array of bytes
Topic: a feed of messages
Producer
Consumer
Broker
Message queue
Topic partitioning
0
1
2
3 brokers - one partition
Kafka (Interesting Version)
So you'v got a message bus now
Yesterday you turned events into rows-in-place
Tomorrow maybe you'll just compute events
without streams
No comments:
Post a Comment