Sunday, August 11, 2019

Distributed Systems in One Lesson by Tim Berglund

Here is the link.


Normally simple tasks like running a program or storing and retrieving data become much more complicated when we start to do them on collections of computers, rather than single machines. Distributed systems has become a key architectural concern, and affects everything a program would normally do—giving us enormous power, but at the cost of increased complexity as well. Using a series of examples all set in a coffee shop, we’ll explore topics like distributed storage, computation, timing, messaging, and consensus. You'll leave with a good grasp of each of these problems, and a solid understanding of the ecosystem of open-source tools in the space.

Three characteristic

The computers operate concurrently
The computers fail independently
The computers do not share a global clock

Three topics

Storage
Computation
Messaging

Single-master storage

More read than write

Read replication

Distributed database, replicate two databases - use space buy time, short time to read
Eventually consistent database

Master, lead server

10:22

Sharding

Break data model - cannot join cross shards -

How to take joins away? Denormalize, read is slow. To add Index is not option.

Consistent hashing - technique used

Cassandra database -

Replication -> two copies - distributed system, computer fails - I solved the problem, be up, tolerant of hardware failure

Consistency - three copies of data.

A rule -

Consistency

R+W> N

N - number of replicas

Strongly consistent database

19:40/48:59
CAP Theorem

Simple way to understand - distributed database -
Consistency
Available - read works, write works
Partition tolerance

Shared writing project
Coffee shop closes
Synchronizing over the phone
Battery dies
Status report

Choose unavailable - Give you an answer, but it may be wrong
Give up consistency
Sometimes we can not have three

Distributed computation

One process -

MapReduce

Map

Counting words,

Shuffle

similar words near each other

Reduce - add up the number,

300 computers, replicated all

Hadoop

MapReduce API
MapReduce job management
Distributed Filesystem (HDFS)
Enormous ecosystem

Spark

Scatter/gather paradigm (similar to MapReduce)
More general data model (RDDs, DataSets)
More general programming model(transform/action)
Storage agnostic

Kakfa


Everything is stream
Focuses on real-time analysis, not batch jobs
Streams and streams only
Except streams are also tables (sometimes)
No cluster required

Messaging

Means of loosely coupling subsystems
Messages consumed by subscribers
Created by one or more producers
Organized into topics
Processed by brokers
Usually persistent over the short term

Messaging problems

What if a topic gets too big for one computer?
What if one computer is not reliable


Apache Kafka

Definitions
Message: an immutable array of bytes
Topic: a feed of messages
Producer
Consumer
Broker

Message queue

Topic partitioning

0
1
2

3 brokers - one partition

Kafka (Interesting Version)

So you'v got a message bus now
Yesterday you turned events into rows-in-place
Tomorrow maybe you'll just compute events

without streams






No comments:

Post a Comment