Julia's coding blog - Practice makes perfect

From January 2015, she started to practice leetcode questions; she trains herself to stay focus, develops "muscle" memory when she practices those questions one by one. 2015年初, Julia开始参与做Leetcode, 开通自己第一个博客. 刷Leet code的题目, 她看了很多的代码, 每个人那学一点, 也开通Github, 发表自己的代码, 尝试写自己的一些体会. She learns from her favorite sports – tennis, 10,000 serves practice builds up good memory for a great serve. Just keep going. Hard work beats talent when talent fails to work hard.

Wednesday, August 7, 2019

Scaling Instagram Infrastructure

Here is the link.

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E

Instagram stack

Cassandra
PostgreSQL Django other services

memcache

RabbitMQ

Celery

-----
Data centers - different data centers
storage vs computing

storage: need to be consistent across data centers
computing: driven by user traffic, as needed basis

Scale out: storage

PostgreSQL user, media, friendship etc.

Django -> read -> replica DC1
-> write -> master DC2
replica DC3

Latency -

Cassandra user freeds, activities etc.

replica - replica - replica

consistency 2, read as 1

Write - 2
Read - 1

Computing

django RbbitMQ, global balancer -> asynchronous task

DC1 CD2

memcache

- high performance key-value store in memory
millions of reads/writes per second
sensistive to network condition

cross region operation is prohibitive

NO global consistency

Let see what problems came out

Counters

select count(*) from user_like_media where media_id = 12345;

100s ms

memcache - database

select count from media_likes where media_id = 12345

10s us

Cache invalidated

All djangos

Memcache lease

time d1 d2 memcache db

lease-get -> fill
lease-get -> wait or use state

read from dB

lease-set

lease-get

DC1 DC2

Scaling out

capacity
reliability

scaling out - challenges, opportunities

Beyond North America
More localized social network

17:00
CPU impact: -12 to 10 %
Regression -> a real problem

CPU monitor analyze optimize
CPU - analyze
continuous profiling

generate_profile explore --start <start-time> --duration <minutes>

Caller
Callee

Optimize -
reload instagram feed - each url

variety of mobile - best user experience -
multiple url to mobile device, size of media
do less
300x300
150x150
400x600
200x200

C is really faster
Candidate functions:
Used extensively
Stable

Cython or C/C++

Scale up

Use as few CPU instructions as possible
Use as few servers as possible

One web server
process 1 ... N
Reduce code
Run in optimized mode (-O)
Remove dead code
CProfile - code never executed - remove those code
Share more -

scale up: memory

Move configuration to shared memory
matrix to measure tradeoff
Disable garbage collection
20+% capacity increase

Scale up: network latency

Synchronous processing model with long latency

Shared memory / private memory

Django -> async IO (Feed, news, friends suggestion) - not sequential

Faster python run-time
Debugging friendly - Python is better than C/ C++
Async web framework
- extra service
Better memory analysis
etc etc

Scale dev team

Scaling team

30% engineers joined in last 6 months
Intern - 12 weeks
hack-A-Month - 4 weeks
Bootcampers - 1 week

ramp up time ?

32:00/ 51:11
Features
Saved posts
comment filtering
Multiple media in one post
First story notification
windows App
Instagram Live
Video View notification
Self-hard prevention

Product engineer asks those questions:

Which server?
NewTable or New column?
What index?
Should I cache it?
Will I lock up DB?

Heavy process
Will I bring down instagram?

Infrastructure engineer baby sit ...

What we want
- automatically handle cache
Define relations, not worry about implementation

Tao - Data model and API

Source control

with branches
Context switching
Code sync/merge overhead
Surprise
Refactor/major upgrade
Performance tracking harder

Adopt one master approach
No branches
- continuous integration
Collaborate easily
Fast bisect and revert
continuous performance monitoring

Feature launch

36:56/ 51:11
Engineers
Dogfooder
Employees
Some demographics
World

Feature load test

how many user uses the feature - feature load test
Ship it live? Once a week? One a day? Once a diff !!
40-60 rollouts a day

Checks and balances

Code review/ unittest -> code accepted committed -> Canary -> To the wild

Alert system - Do the needful and revert

Scale up
Scale dev team
Scale out

40:00
Takeaways
Scaling is continuous effort
Scaling is multi-dimensional
Scaling is everybody's responsibilities

41:00
Questions

Julia's coding blog - Practice makes perfect

Wednesday, August 7, 2019

Scaling Instagram Infrastructure

No comments:

Post a Comment