https://www.youtube.com/watch?
Instagram stack
Cassandra
PostgreSQL Django other services
memcache
RabbitMQ
Celery
-----
Data centers - different data centers
storage vs computing
storage: need to be consistent across data centers
computing: driven by user traffic, as needed basis
Scale out: storage
PostgreSQL user, media, friendship etc.
Django -> read -> replica DC1
-> write -> master DC2
replica DC3
Latency -
Cassandra user freeds, activities etc.
replica - replica - replica
consistency 2, read as 1
Write - 2
Read - 1
Computing
django RbbitMQ, global balancer -> asynchronous task
DC1 CD2
memcache
- high performance key-value store in memory
millions of reads/writes per second
sensistive to network condition
cross region operation is prohibitive
NO global consistency
Let see what problems came out
Counters
select count(*) from user_like_media where media_id = 12345;
100s ms
memcache - database
select count from media_likes where media_id = 12345
10s us
Cache invalidated
All djangos
Memcache lease
time d1 d2 memcache db
lease-get -> fill
lease-get -> wait or use state
read from dB
lease-set
lease-get
DC1 DC2
Scaling out
capacity
reliability
scaling out - challenges, opportunities
Beyond North America
More localized social network
17:00
CPU impact: -12 to 10 %
Regression -> a real problem
CPU monitor analyze optimize
CPU - analyze
continuous profiling
generate_profile explore --start <start-time> --duration <minutes>
Caller
Callee
Optimize -
reload instagram feed - each url
variety of mobile - best user experience -
multiple url to mobile device, size of media
do less
300x300
150x150
400x600
200x200
C is really faster
Candidate functions:
Used extensively
Stable
Cython or C/C++
Scale up
Use as few CPU instructions as possible
Use as few servers as possible
One web server
process 1 ... N
Reduce code
Run in optimized mode (-O)
Remove dead code
CProfile - code never executed - remove those code
Share more -
scale up: memory
Move configuration to shared memory
matrix to measure tradeoff
Disable garbage collection
20+% capacity increase
Scale up: network latency
Synchronous processing model with long latency
Shared memory / private memory
Django -> async IO (Feed, news, friends suggestion) - not sequential
Faster python run-time
Debugging friendly - Python is better than C/ C++
Async web framework
- extra service
Better memory analysis
etc etc
Scale dev team
Scaling team
30% engineers joined in last 6 months
Intern - 12 weeks
hack-A-Month - 4 weeks
Bootcampers - 1 week
ramp up time ?
32:00/ 51:11
Features
Saved posts
comment filtering
Multiple media in one post
First story notification
windows App
Instagram Live
Video View notification
Self-hard prevention
Product engineer asks those questions:
Which server?
NewTable or New column?
What index?
Should I cache it?
Will I lock up DB?
Heavy process
Will I bring down instagram?
Infrastructure engineer baby sit ...
What we want
- automatically handle cache
Define relations, not worry about implementation
Tao - Data model and API
Source control
with branches
Context switching
Code sync/merge overhead
Surprise
Refactor/major upgrade
Performance tracking harder
Adopt one master approach
No branches
- continuous integration
Collaborate easily
Fast bisect and revert
continuous performance monitoring
Feature launch
36:56/ 51:11
Engineers
Dogfooder
Employees
Some demographics
World
Feature load test
how many user uses the feature - feature load test
Ship it live? Once a week? One a day? Once a diff !!
40-60 rollouts a day
Checks and balances
Code review/ unittest -> code accepted committed -> Canary -> To the wild
Alert system - Do the needful and revert
Scale up
Scale dev team
Scale out
40:00
Takeaways
Scaling is continuous effort
Scaling is multi-dimensional
Scaling is everybody's responsibilities
41:00
Questions
No comments:
Post a Comment