Wednesday, August 7, 2019

Scaling Instagram Infrastructure

Here is the link.

Scaling Instagram Infrastructure

https://www.youtube.com/watch?v=hnpzNAPiC0E

Instagram stack

Cassandra
PostgreSQL               Django             other services

memcache

RabbitMQ

Celery  


-----
Data centers - different data centers
storage vs computing

storage: need to be consistent across data centers
computing: driven by user traffic, as needed basis

Scale out: storage

PostgreSQL  user, media, friendship etc. 

Django -> read  -> replica   DC1
       -> write -> master    DC2
                   replica   DC3

Latency - 

Cassandra user freeds, activities etc. 

   replica  - replica - replica

consistency 2, read as 1

Write - 2
Read - 1 

Computing 

   django RbbitMQ, global balancer -> asynchronous task 

DC1  CD2

memcache  

- high performance key-value store in memory
millions of reads/writes per second
sensistive to network condition

cross region operation is prohibitive 

 NO global consistency 

Let see what problems came out


Counters

select count(*) from user_like_media where media_id = 12345;

100s ms

memcache - database 


select count from media_likes where media_id = 12345

10s us


Cache invalidated 

All djangos 

Memcache lease 


 time   d1   d2   memcache   db


lease-get -> fill
lease-get -> wait or use state

read from dB

lease-set

lease-get

  DC1        DC2

Scaling out 

capacity
reliability 

scaling out - challenges, opportunities

Beyond North America
More localized social network

17:00 
CPU impact: -12 to 10 %
Regression -> a real problem 

CPU   monitor   analyze  optimize
CPU - analyze
continuous profiling 

generate_profile explore --start <start-time> --duration <minutes>

Caller 
Callee 

Optimize - 
reload instagram feed - each url 

variety of mobile - best user experience - 
multiple url to mobile device, size of media 
do less 
300x300
150x150
400x600
200x200

C is really faster
Candidate functions:
Used extensively
Stable


Cython or C/C++ 

Scale up 

Use as few CPU instructions as possible
Use as few servers as possible

One web server 
process 1 ... N
Reduce code 
Run in optimized mode (-O)
Remove dead code 
CProfile - code never executed - remove those code
Share more - 

scale up: memory

Move configuration to shared memory
matrix to measure tradeoff
Disable garbage collection 
20+% capacity increase

Scale up: network latency

Synchronous processing model with long latency

Shared memory / private memory 


Django -> async IO (Feed, news, friends suggestion) - not sequential 

Faster python run-time 
Debugging friendly - Python is better than C/ C++
Async web framework 
- extra service
Better memory analysis
etc etc

Scale dev team

Scaling team 

30% engineers joined in last 6 months
Intern - 12 weeks
hack-A-Month - 4 weeks
Bootcampers - 1 week

ramp up time ? 

32:00/ 51:11
Features
Saved posts
comment filtering 
Multiple media in one post
First story notification
windows App
Instagram Live
Video View notification
Self-hard prevention

Product engineer asks those questions:

Which server?
NewTable or New column?
What index?
Should I cache it?
Will I lock up DB?

Heavy process 
Will I bring down instagram?

Infrastructure engineer baby sit ...

What we want
- automatically handle cache
Define relations, not worry about implementation 


Tao - Data model and API

Source control

with branches 
Context switching 
Code sync/merge overhead
Surprise
Refactor/major upgrade
Performance tracking harder

Adopt one master approach 
No branches
- continuous integration
Collaborate easily
Fast bisect and revert
continuous performance monitoring

Feature launch 


36:56/ 51:11
Engineers
Dogfooder
Employees
Some demographics
World


Feature load test 

how many user uses the feature - feature load test 
Ship it live? Once a week? One a day? Once a diff !!
40-60 rollouts a day

Checks and balances

Code review/ unittest -> code accepted committed -> Canary -> To the wild 

Alert system - Do the needful and revert 

Scale up
Scale dev team
Scale out

40:00 
Takeaways
Scaling is continuous effort
Scaling is multi-dimensional
Scaling is everybody's responsibilities

41:00 
Questions










No comments:

Post a Comment