Anatomy of a Throttler, part 3

By Shlomi Noach | November 19, 2024

This is the last installment of a three-part blog series. If you missed the previous articles, you can catch up by reading part one and part two. In this conclusion, we discuss throttler clients and their identities, cooperation, prioritization, and constraint issues.

Identifying clients

Our focus continues to be on asynchronous, batch, and massive operations, such as ETLs, data imports, and schema changes. The components that invoke these operations are the throttler's clients. These components need to break down the operation into reasonably small subtasks and periodically check the throttler for permission to proceed. This is a cooperative model where the client asks for permission, and we discuss an alternative design later on. But first, we'd like to propose that clients should identify themselves to the throttler.

If only for the high-level purpose of analyzing or investigating an incident or being able to generate metrics, you generally want to know which operations were being throttled at what time. You want to be able to tell that the daily aggregation ETL job was mostly throttled between 07:00 and 07:25. Or that around 12:00, the throttler was handling requests from multiple clients, including a data import, a schema migration over the customers table, and an hourly cleanup job.

But client identification also serves operational purposes. Is it possible to prioritize one specific job over others? Or perhaps tune one down, or put it entirely on hold for a while? How about prioritizing a category of clients? Such questions can only be answered if we can clearly distinguish between clients.

Does it even make sense to prioritize client requests? Let's begin with what appears to be an extreme scenario, one that will highlight the risks of prioritization.

Exemption and starvation

In a cooperative model, clients ask the throttler for permission. A rogue client might neglect to connect to the throttler and just go ahead and send some massive workload. Or perhaps the throttler has a mechanism with which to exempt requests from a specific client. The end result is the same: all clients play nicely by the rules, but one gets a free pass to operate without limitation.

Going back to replication lag, let's assume the client's workload is such that it exhausts resources and causes replication lag to spike to the scale of many minutes, well beyond the throttler's threshold. Nothing pushes back this client, and it continues to hammer the database for hours. During that time, requests from all other clients are continuously rejected. This is a starvation scenario.

Exemption is risky because it not only blocks operation of other players but can also degrade system performance, going against the very reason for the throttler's existence. In some sense it breaks the rules; yet, it has a place as we'll discuss later.

Prioritization

A safer way is to play within the rules. Instead of exempting, we can consider prioritization, or rather de-prioritization. This can be done using a dice roll: a client asks the throttler for permission. The throttler can choose to roll a die, and if the result is, say, 1 or 2, flat out reject the request, irrespective of the system metrics. We thus consider a ratio of the requests to be rejected.

A rejected client will back off, sleep for a while, then try again. The database is therefore less busy, at the expense of pushing back potential client work. But if we can selectively choose to have a high rejection ratio to one client, while having a low (or zero) rejection ratio to a second client, then we've effectively prioritized the second over the first: the first client will spend more time backing off, even if the database metrics are healthy. During such time, the second client will have more opportunity to do its own work.

It's important to highlight that both clients still play by the rules: none is given permission to act if the database has unhealthy metrics. It's just that one sometimes doesn't even get the chance to check those metrics.

In another model, one could configure the throttler to reject a ratio of requests for all clients, and then have a lower, or zero rejection ratio for a particular client. Thus, a safe way to prioritize one client over others is to de-prioritize all other clients.

Throttling on different metrics

Does it make sense for different clients to throttle based on different metrics? For example, one client would throttle based on replication lag, and then a second client would throttle based on replication lag and also on load average.

Looking closely, this is an exemption scenario. While the second client throttles based on load average, the first client is effectively exempted from checking load average. If that first client's workload is such that it does indeed push load average beyond its threshold, then the second client becomes starved. It never gets a chance to operate.

And yet, this is nuanced. Not all jobs are created equal. Some copy data, others purge data. Some work on busy tables with high write contention, others deal with old data that is not in memory. These different jobs will have different impacts on the system. In practice, the engineer or administrator will be familiar with the type of impact of a specific job and can explicitly assign a specific metric to that job. Does that mean we necessarily need to apply the same metric to all other clients? Logically yes, but in practice, no. In our example above, the first client, exempted from checking load average, might not have a significant impact on the metric in the first place. If load average were to be high even without the first client, throttling that client may not have any impact at all, so we may as well just let it complete its job. There are practical considerations that we should examine as we operate our system.

Where exemption makes sense

Nothing lasts forever. If a client is starved for 10 minutes out of a total runtime of 12 hours, this may not be a big deal.

If a task absolutely has to run at all costs (e.g., fixing an incident) and that pushes resources beyond what we want to see in normal times, so be it.

If the client is an essential part of the system itself, and goes through the throttling mechanism due to data flow design, and does not handle massive data changes, then we may and should exempt it altogether.

Categorization and breakdown

It is further beneficial if a client can identify itself on different levels. For example, in Vitess, a client may be identified as d666bbfc_169e_11ef_b0b3_0a43f95f28a3:vcopier:vreplication:online-ddl. This is a vreplication job with ID d666bbfc_169e_11ef_b0b3_0a43f95f28a3, specifically running the vcopier flow, on behalf of an online-ddl schema migration.

With this identity scheme, it is possible to categorically prioritize (or de-prioritize) all online-ddl jobs, or just this very specific job, or alternatively exempt all vcopier flows entirely.

Observability-wise, this makes it easier to analyze throttler access patterns by categories of requests.

Nothing lasts forever

Jobs and operations eventually complete. But it's also a good idea to put a time limit on any rules you may have set. If you've exempted a category of clients, then it's best if that exemption expires at some point. It is useful to de-prioritize all jobs for a couple of hours during rush hour if some unexpected workload is received, or for the duration of an ongoing investigation.

Cooperation vs. enforcement

We've discussed the potential for rogue (or malfunctioning) clients to skip throttler checks. This is a possible scenario in the cooperative design. An alternative throttling enforcement design puts the throttler between the client and the system. The throttler runs as a proxy, or integrates with an existing proxy, to be able to throttle client requests. Such is the Vitess transaction throttler, which can actively delay database query execution when system performance degrades. Clients cannot bypass the throttler, and may not even be aware of its existence. As such, it's more complicated to identify the clients, and the throttler must rely on domain-specific attributes made available by the client/connection/query to be able to distinguish between clients and implement any needed prioritization.

Conclusion

While we've mostly discussed throttling database systems, the principles laid out should be applicable to throttlers of all systems and services. Dynamic control of the throttler is absolutely critical, and the ability to prioritize or push back specific requests or jobs is essential in production systems.