Anatomy of a Throttler, part 2

By Shlomi Noach | October 10, 2024

This is part 2 of a 3-part series. You can catch up with part 1 here.

Up until now, we've only discussed the existence of the throttler and did not make assumptions about its deployment or distribution. We have referenced it in singular form, and indeed it is possible to run a throttler as a singular service. Let's consider the arguments for singular vs. distributed throttler deployments.

Singular throttler design

A singular throttler is an all-knowing service that serves all requests. It should be able to probe and collect all metrics, which means it may need direct access to all database servers and to OS metrics (such as load average, which we mentioned in part 1). Such access could be acceptable in some environments. The throttler will likely hold on to a large group of persistent connections, each of which the throttler will use to read a subset of metrics. It's a simple, monolithic, synchronous approach.

This approach immediately calls into question the issue of high availability. What do you do if the throttler host goes down? We can first assume it is possible to spin up a new throttler service. It may take a few moments until it's up and running, and until it has established all connections and has begun to collect data. In the interim, what do clients do? This is where a design decision must be made: do clients fail open, or do they fail closed? If no throttler is available, should clients consider themselves rejected, or should they proceed with full power? A common approach is to hold off up to some timeout and, from there on, proceed unthrottled, taking into consideration the possibility that the throttler may not be up for a while.

Another option is to run multiple, independent instances of the throttler. They could run in different availability zones, all oblivious to each other, and so we still consider them to be singular. They will all collect the same metrics from the same servers, albeit at slightly different timings. This means two independent throttlers will overall exhibit the same behavior allowing/refusing requests, but it is possible that at any specific point in time, two throttlers will disagree with each other.

More combinations exist, such as having active-passive throttlers, with traffic always directed to the active one. The passive throttlers will be readily available should the active one step down. They will possibly be collecting metrics while passive.

It is clear that the singular approach is susceptible to a scaling problem. There's only so many connections it can maintain while running high-frequency probing.

Metrics access/API

Another issue concerns environments where the throttler cannot be allowed direct access to database and OS metrics due to security restrictions. A common solution is to set up an HTTP server, some API access point, which the throttler can check to get the host metrics. A daemon or an agent running on the host is responsible for collecting the metrics locally to the host. The throttler's work is now made simpler, as it may need to only hit a single access point to collect all metrics for a given host. However, this design introduces many complexities.

First and foremost, the addition of a new component that needs to reliably run continuously on the host. Metric collection is no longer in the hands of the throttler, and we must trust that component to grab those metrics. We then introduce an API or otherwise some handshake between the throttler and the metric collection component, which must be treated with care when it comes to upgrades and backwards compatibility.

Last but not least, we lose synchronicity and introduce multiple layers of collection polling intervals: the agent collects metrics at its own interval. For example, the agent might collect data once per second, while the throttler polls the API at its own interval, which could also be once per second. The throttler may now collect data that is up to 2 seconds stale, as opposed to up to 1 second stale in the monolithic, synchronous approach. Note that there is also a latency effect to the staleness of metrics, but we will leave this discussion to a later stage.

In adding this agent component, even if it's the simplest script to scrape and publish metrics, we've effectively turned our singular throttler into a distributed multi-component system.

Distributed throttler design

We now consider the case for a distributed throttler design, and again find that there are multiple approaches to distribution.

In the context of your environment, does it make sense for a throttler to probe hosts/services outside its availability zone? It is possible to run a throttler per AZ that only collects metrics within that AZ. Each throttler considers itself to be singular and is oblivious to other throttlers, but clients need to know which throttler to address.

A variation of the above could introduce some collaboration between the throttlers: a us-east-1 throttler can advertise its own collected metrics to a us-west-1 throttler. This lets the us-west-1 throttler grab all us-east-1 metrics in one go, and without having to have direct access to the underlying hosts or services. This is a glorified and scaled implementation of the agent architecture above, where all components are throttlers.

Or, we can do functional partitioning on our architectural elements. Some services are disjoint, and we can run different throttlers for different elements, grouped by functionality association. In this design throttlers are again independent of each other.

Going even more granular, there may be a throttler associated with any host, or with any probed service.

A granular throttler design case study

The Vitess tablet throttler combines multiple design approaches to achieve different throttling scopes. The throttler runs on each and every vttablet, mapping one throttler for each MySQL database server. Each such throttler first and foremost collects metrics from its own vttablet host and from its associated MySQL server. Then, the throttlers (or vttablet servers) of any shard, or a replication topology (primary and replicas) collaborate to represent the "shard" throttler. The throttler service on the primary server takes the responsibility for collecting the metrics from all of the shard's throttlers and aggregating them as the "shard" metrics.

Thus, a massive write to the primary is normally throttled by replication lag, a metric collected from all serving replicas. Clients consult with the primary throttler, which accepts or rejects their requests based on the highest lag among all replicas. In contrast, some operations only require a massive read, which can take place on a specific replica. These reads can pollute the replica's page cache and overload its disk I/O, causing it to lag. This will, however, have no effect on the replication performance of other replicas. The client can therefore suffice in checking the throttler on the specific replicas, ignoring any metrics from other servers. This introduces the concept of a metric's scope, which can be an entire shard or a specific host in this scenario.

Different shards represent distinct architectural elements, and there is no cross-shard throttler communication. This limits the hosts/services monitored by any single throttler to a sustainable amount.

It's important to remember that any cross-throttler communication introduces the layered collection poll interval and reduction of granularity, as discussed above.

Reducing the throttler's impact

Can the throttler itself generate load on the system? Let's avoid the infinite recursion trap of throttle-the-throttler and consider the usage pattern.

The throttler can introduce load on the system by over-probing for metric data, as well as by over-communicating between throttler nodes. Clients can introduce load by overwhelming the throttler with requests.

To begin with the low-hanging fruit, busy loops should be avoided at all times. A rejected client should sleep for a pre-determined amount of time before rechecking the throttler. Conversely, depending on the metric, a client might get a free pass for a period of time after a successful check. Consider the case for replication lag: if a client checks for lag and the lag is 0.5sec and the threshold is 5sec, then the next 4.5sec (up to metric collection granularity) are guaranteed to be successful and can be skipped.

The metrics should be collected at an appropriate granularity for the situation. We mentioned the use case for massive background jobs. A job can take a few hours to run, but there may yet be a few hours break between two jobs. During that break, there isn't strictly a need for the throttler to collect data at high rate, and likewise, no strict need for throttlers to communicate with each other at high rate. The throttler can choose to slow down based on lack of requests. It could either stop collecting metrics altogether and go into hibernation, or it might just slow down its normal pace. It would take a client checking the throttler to re-ignite the high-frequency collection of metrics.

This does come at a cost because the very first check, and likely also the next few checks, will run on stale data and potentially reject requests that would otherwise be accepted. However, we defer to the expected client retry mechanism which will induce another check, at such time that the throttler is again fully engaged and has up-to-date metrics.

Another caveat is that with a distributed throttler design, throttlers which depend on each other should be able to inform each other upon being checked. All throttlers who communicate with each other should re-ignite upon the first request to any of them.

Metric hibernation case study

Metric collection is not the only process that can hibernate. Metric generation itself can also hibernate. This may sound ludicrous at first: aren't metrics just there? Let us discuss replication heartbeats.

In Part 1, we've elaborated on the role of replication lag as the single most used throttling metric. While there are alternative techniques, it has been established in the MySQL world that the most reliable way to evaluate replication lag is by injecting timestamps on a dedicated table on the Primary server, then reading the replicated value on a replica, comparing it with the system time on said replica. This technique works well when the replicas are working well, when they are lagging, when replication is stopped, and even when replication gets broken or misconfigured.

There must be some process to routinely inject heartbeats on the Primary database server (pt-heartbeat is a popular tool to do so). You'd need to ensure heartbeats are only ever written on the Primary, and handle failovers and promotion situations.

The injection interval dictates the lag metric granularity, and you need to balance the desire to have the most granular replication lag metric with the overhead of generating those writes.

But then there is another impact to heartbeat generation using this technique: the heartbeat events are persisted in the binary logs, which are then re-written on the replicas. For some users, the introduction of heartbeats causes a significant increase in binlog generation. With more binlog events having to be persisted, more binary log files are generated per given period of time. These consume more disk space. It is not uncommon to see MySQL deployments where the total size of binary logs is larger than the actual data set. Furthermore, you may wish to retain the binary logs for an extended period of time, for recovery/auditing purposes, and you may wish to back them up. You'll need larger disks, more backup storage space, and this translates to expenses.

It thus also makes sense to avoid generating those heartbeats, or generate them much less frequently, when not absolutely needed. When no massive background job takes place, we can enjoy little to no overhead, and when there is a massive operation, the heartbeat events are but a small overhead compared with the amounts of data written to the binary logs.

Like throttler hibernation, these heartbeats must be re-ignited at the right time. Like throttler hibernation, the first few checks will read outdated heartbeats, and are most likely to be rejected. It may take a few seconds to get to a fully active operation, when the throttler has re-engaged, heartbeats re-generated, and replication is caught up with at least the very first re-generated heartbeats, and the clients must be prepared for some retries.

To be continued

In the next and final part of this series we will discuss clients, prioritization, starvation scenarios, and more.