Thursday, August 8, 2019

Messaging at Scale at Instagram

Messaging at Scale at Instagram

https://www.youtube.com/watch?v=E708csv4XgY

Chaines tasks

Batch of 10,000 followers per task
Tasks yield successive tasks
much finer-grained load balancing
Failure/Reload penalty low

Other Async tasks

Cross-posting to other networks
search indexing
spam analysis
account deletion
API hook



Gearman framework  - load balancer

Gearman in production

persistence horrifically slow, complex
So we ran out of memory and crashed, no recovery
Single core, didn't scale well;
60ms mean submission time for us
Probably should have just used Redis

(Gearman vs Redis)

Celery

Distributed task framework
Highly extensible, pluggable
Mature, Feature rich
Great tooling
Excellent Django support
celeryd

Which broker?

Redis

We already use it
Very fast, eifficient
Polling for task distribution
Messy Non-Synchronous replication
Memory limits task capacity


Beanstalk

Purpose-built task queue
Very fast, efficient
Pushes to Consumers
Spills to disk
No replication
Useless for anything else


RabbitMQ

Reasonablely fast, efficient
Spill-to-disk
Low-maintenance synchronous replication
Excellent celery compatibility
Supports other use cases
We don't know Erlang


Out RabbitMQ Setup

Rabbit MQ 3.0
Clusters of two brokers nodes, Mirrowed
Scale out by adding broker clusters
EC2 c1.xlarge, RAID instance storage
Way overprovisioned


Alerting

We use Sensu
Monitors & alerts on queue length threshold
Uses rabbitmqctl list_queues


Scaling out

Celery only supported 1 broker host last year when we started
Created kombu-multbroker "shim"
Multiple brokers used in a round-robin fashion
Breaks some Celery management tools :(

Concurrency models

multiprocessing (pre-fork)
eventlet
gevent
threads

Problem:

Network-bound tasks sometimes need to take some action


Ruun higer concurrency?
Inefficient :(

Lower batch (prefetch) size?
Min is concurrency count, inefficient :(

Separate slow & fast tasks :)

OUr concurrency levels

fast (14)
feed (12)
default (6)

Problem
Task fails sometimes


Work crashes still lost task






Problem:
Slow tasks monopolize workers


NLP proof gives us choices: to retry or not to retry

problem

Early on, drop task

Publishers confirms

Confirm tasks

Avoid using async tasks as a "backup" mechanism only during failures. It'll probably break.



Better grip on RabbitMQ performance
Utilize result storage
Single cluster for control queues
Eliminate kombu-multibroker





-- study topics --


Sensu

https://docs.sensu.io/sensu-core/1.4/reference/checks/#what-is-a-sensu-check




----------------------------------

Study one more topic

AWS Elastic Beanstalk
https://aws.amazon.com/elasticbeanstalk/

AWS Elastic Beanstalk

Load balancing, provisioning, Application health monitoring, Auto scaling


capacity provisioning, load balancing, auto-scaling, and application health monitoring.

Manageing and configuring servers, databases, load balancers, firewalls, and networks.

No comments:

Post a Comment