Wednesday, March 23, 2022

Migrating Millions of Concurrent Websockets to Envoy | 10 minutes reading

March 23, 2022

Here is the article. 

Slack has a global customer base, with millions of simultaneously connected users at peak times. Most of the communication between users involves sending lots of tiny messages to each other. For much of Slack’s history, we’ve used HAProxy as a load balancer for all incoming traffic. Today, we’ll talk about problems we faced with HAProxy, how we solved them with Envoy Proxy, the steps involved in the migration, and what the outcome was. Let’s dive in!

Websockets at Slack

To deliver messages instantly, we use a websocket connection, a bidirectional communications link which is responsible for you seeing “Several people are typing…” and then the thing they typed, nearly as fast as the speed of light permits. The websocket connections are ingested into a system called “wss” (WebSocket Service) and accessible from the internet using wss-primary.slack.com and wss-backup.slack.com (it’s not a website, you just get a HTTP 404 if you go there).

Websocket connections start out as regular HTTPS connections, and then the client issues a protocol switch request to upgrade the connection to a websocket. At Slack, we have different websocket services dedicated to messages, to presence (listing which contacts are online), and to other services. One of the websocket endpoints is specifically made for apps that need to interact with Slack (because apps want real-time communication too).

Motivation to migrate to Envoy Proxy

While we have been using HAproxy since the beginning of Slack and knew how to operate it at scale, there were some operational challenges that made us consider alternatives, like Envoy Proxy.

Hot Restarts

At Slack, it is a common event for backend service endpoint lists to change (due to instances being added or cycled away). HAProxy provides two ways to update its configuration to accommodate changes in endpoint lists. One is to use the HAProxy Runtime API. We used this approach with one of our sets of HAProxy instances, and our experience is described in another blog post — A Terrible, Horrible, No-Good, Very Bad Day at Slack. The other approach, which we used for the websockets load balancer (LB), is to render the backends into the HAProxy configuration file and reload HAProxy.

With every HAProxy reload, a new set of processes is created to handle the new incoming connections. We’d keep running the old process for many hours to allow long-lived websocket connections to drain and avoid frequent disconnections of users. However, we can’t have too many HAProxy processes each running with it’s own “at the time” copy of the configuration — we wanted instances to converge on the new version of the configuration faster. We had to periodically reap old HAProxy processes, and restrict how often HAProxy could reload in case there was a churn in underlying backends.

Whichever approach we used, it needed some extra infrastructure in place for managing HAProxy reloads.

Envoy allows us to use dynamically configured clusters and endpoints, which means it doesn’t need to be reloaded if the endpoint list changes. If code or configuration do change, Envoy has the ability to hot restart itself without dropping any connections. Envoy watches filesystem configurations with inotify for updates. Envoy also copies statistics from the parent process to the child process during a hot restart, so gauges and counters don’t get reset.

This all adds up to a significant reduction in operational overhead with Envoy, and no additional services needed to manage configuration changes or restarts.

Load Balancing Features

Envoy provides several advanced load-balancing features, such as:

  • Built-in support for zone-aware routing
  • Panic Routing – Envoy will generally route traffic only to healthy backends, but it can be configured to send traffic to all backends, healthy or unhealthy, if the percentage of healthy hosts drops below a threshold. This was very helpful during our January 4, 2021 outage, which was caused by a widespread network problem in our infrastructure.

Because of the above reasons, in 2019, we decided to migrate our ingress load balancing tier from HAproxy to Envoy Proxy, starting with the websockets stack. The major goals of the migration were improved operability, access to new features that Envoy provides, and more standardization. By moving from HAProxy to Envoy across all of Slack, we would eliminate the need for our team to know the quirks of two pieces of software, to maintain two different kinds of configuration, to manage two build and release pipelines, and so on. By then, we were already using Envoy Proxy as the data plane in our service mesh. We also have experienced Envoy developers in-house, so we have ready access to Envoy expertise.

Generating Envoy configuration

The first step in this migration was to review our existing websocket tier configuration and generate an equivalent Envoy configuration. Managing Envoy configuration was one of our biggest challenges during the migration. Envoy has a rich feature set, and its configurations are quite different to those of HAProxy. Envoy configuration deals with four main concepts:

  • Listeners, which receive requests, aka TCP sockets, SSL sockets, or unix domain sockets
  • Clusters, representing the internal services that we send requests to, like message servers and presence servers
  • Routes, which glue listeners and clusters together
  • Filters, which operate on requests

Configuration management at Slack is primarily done via Chef. When we started with Envoy, we deployed envoy configuration as a chef template file, but it became cumbersome and error-prone to manage. To solve this problem, we built chef libraries and custom resources for generating Envoy configurations.

envy cookbook -> libraries -> 1.listner.rb 2. route.rb 3. cluster.rb 4. http_filter.rb 5. envoy_config.rb


Inside Chef, the configuration is a Singleton, modelling the fact that there is only one Envoy configuration per host. All Chef resources operate on that singleton, adding the listeners, routes, or clusters. At the end of the chef run, the envoy.yaml gets generated, validated, and then installed — we never write intermediate configurations, because these could be invalid.

This example shows how we can create one HTTP listener with two routes that routes traffic to two dynamic clusters.

It took some effort to replicate our complicated HAProxy configuration in Envoy. Most of the features needed were already available in Envoy so it was just a matter of adding the support for it to the chef library and voila! We implemented a few missing Envoy features (some were contributed upstream and some are maintained in-house as extensions).



No comments:

Post a Comment