Messaging at Scale at Instagram
https://www.youtube.com/watch? v=E708csv4XgY
Chaines tasks
Batch of 10,000 followers per task
Tasks yield successive tasks
much finer-grained load balancing
Failure/Reload penalty low
Other Async tasks
Cross-posting to other networks
search indexing
spam analysis
account deletion
API hook
Gearman framework - load balancer
Gearman in production
persistence horrifically slow, complex
So we ran out of memory and crashed, no recovery
Single core, didn't scale well;
60ms mean submission time for us
Probably should have just used Redis
(Gearman vs Redis)
Celery
Distributed task framework
Highly extensible, pluggable
Mature, Feature rich
Great tooling
Excellent Django support
celeryd
Which broker?
Redis
We already use it
Very fast, eifficient
Polling for task distribution
Messy Non-Synchronous replication
Memory limits task capacity
Beanstalk
Purpose-built task queue
Very fast, efficient
Pushes to Consumers
Spills to disk
No replication
Useless for anything else
RabbitMQ
Reasonablely fast, efficient
Spill-to-disk
Low-maintenance synchronous replication
Excellent celery compatibility
Supports other use cases
We don't know Erlang
Out RabbitMQ Setup
Rabbit MQ 3.0
Clusters of two brokers nodes, Mirrowed
Scale out by adding broker clusters
EC2 c1.xlarge, RAID instance storage
Way overprovisioned
Alerting
We use Sensu
Monitors & alerts on queue length threshold
Uses rabbitmqctl list_queues
Scaling out
Celery only supported 1 broker host last year when we started
Created kombu-multbroker "shim"
Multiple brokers used in a round-robin fashion
Breaks some Celery management tools :(
Concurrency models
multiprocessing (pre-fork)
eventlet
gevent
threads
Problem:
Network-bound tasks sometimes need to take some action
Ruun higer concurrency?
Inefficient :(
Lower batch (prefetch) size?
Min is concurrency count, inefficient :(
Separate slow & fast tasks :)
OUr concurrency levels
fast (14)
feed (12)
default (6)
Problem
Task fails sometimes
Work crashes still lost task
Problem:
Slow tasks monopolize workers
NLP proof gives us choices: to retry or not to retry
problem
Early on, drop task
Publishers confirms
Confirm tasks
Avoid using async tasks as a "backup" mechanism only during failures. It'll probably break.
Better grip on RabbitMQ performance
Utilize result storage
Single cluster for control queues
Eliminate kombu-multibroker
-- study topics --
Sensu
https://docs.sensu.io/sensu- core/1.4/reference/checks/# what-is-a-sensu-check
------------------------------ ----
Study one more topic
AWS Elastic Beanstalk
https://aws.amazon.com/ elasticbeanstalk/
AWS Elastic Beanstalk
Load balancing, provisioning, Application health monitoring, Auto scaling
capacity provisioning, load balancing, auto-scaling, and application health monitoring.
Manageing and configuring servers, databases, load balancers, firewalls, and networks.
https://www.youtube.com/watch?
Chaines tasks
Batch of 10,000 followers per task
Tasks yield successive tasks
much finer-grained load balancing
Failure/Reload penalty low
Other Async tasks
Cross-posting to other networks
search indexing
spam analysis
account deletion
API hook
Gearman framework - load balancer
Gearman in production
persistence horrifically slow, complex
So we ran out of memory and crashed, no recovery
Single core, didn't scale well;
60ms mean submission time for us
Probably should have just used Redis
(Gearman vs Redis)
Celery
Distributed task framework
Highly extensible, pluggable
Mature, Feature rich
Great tooling
Excellent Django support
celeryd
Which broker?
Redis
We already use it
Very fast, eifficient
Polling for task distribution
Messy Non-Synchronous replication
Memory limits task capacity
Beanstalk
Purpose-built task queue
Very fast, efficient
Pushes to Consumers
Spills to disk
No replication
Useless for anything else
RabbitMQ
Reasonablely fast, efficient
Spill-to-disk
Low-maintenance synchronous replication
Excellent celery compatibility
Supports other use cases
We don't know Erlang
Out RabbitMQ Setup
Rabbit MQ 3.0
Clusters of two brokers nodes, Mirrowed
Scale out by adding broker clusters
EC2 c1.xlarge, RAID instance storage
Way overprovisioned
Alerting
We use Sensu
Monitors & alerts on queue length threshold
Uses rabbitmqctl list_queues
Scaling out
Celery only supported 1 broker host last year when we started
Created kombu-multbroker "shim"
Multiple brokers used in a round-robin fashion
Breaks some Celery management tools :(
Concurrency models
multiprocessing (pre-fork)
eventlet
gevent
threads
Problem:
Network-bound tasks sometimes need to take some action
Ruun higer concurrency?
Inefficient :(
Lower batch (prefetch) size?
Min is concurrency count, inefficient :(
Separate slow & fast tasks :)
OUr concurrency levels
fast (14)
feed (12)
default (6)
Problem
Task fails sometimes
Work crashes still lost task
Problem:
Slow tasks monopolize workers
NLP proof gives us choices: to retry or not to retry
problem
Early on, drop task
Publishers confirms
Confirm tasks
Avoid using async tasks as a "backup" mechanism only during failures. It'll probably break.
Better grip on RabbitMQ performance
Utilize result storage
Single cluster for control queues
Eliminate kombu-multibroker
-- study topics --
Sensu
https://docs.sensu.io/sensu-
------------------------------
Study one more topic
AWS Elastic Beanstalk
https://aws.amazon.com/
AWS Elastic Beanstalk
Load balancing, provisioning, Application health monitoring, Auto scaling
capacity provisioning, load balancing, auto-scaling, and application health monitoring.
Manageing and configuring servers, databases, load balancers, firewalls, and networks.
No comments:
Post a Comment