Navigation

Blog|Vitess

Consensus algorithms at scale: Part 6 - Completing requests

By Sugu Sougoumarane |

If you’re still catching up, you can find links to each article in the series at the bottom of this article.

The completion of requests is a relatively straightforward topic. However, there are a few concepts to cover before we jump into the more complex part of propagating requests during a leadership change.

Recap of parts 1-5

  • We have redefined the problem of consensus with the primary goal of solving durability in a distributed system, and are approaching the problem top-down.
  • We have shown a way to make durability an abstract requirement instead of the more rigid approach of using majority quorums.
  • We have defined a high level set of rules that can satisfy the properties of a consensus system while honoring arbitrary (but meaningful) durability requirements.
  • We have shown that conceptualizing leadership change as revocation and establishment leads to more implementation options that existing systems don’t utilize.
  • We have also shown that there exist two fundamentally different approaches to handling race conditions, and covered their trade-offs.

You can find links to all previous blog posts of the series at the bottom of this article.

Consensus system requirements

The primary requirement of a consensus system is that it must not forget a request it has acknowledged as accepted. To fulfill this requirement, the leader must transmit the request to enough nodes such that the durability requirements are satisfied. On the other hand, there must exist a mechanism to cancel the request if the operation fails before the durability requirements are met.

Two-phase protocol

The above requirements can be met by introducing a two-phase protocol. The leader first transmits the payload of the request as tentative to all the nodes. A tentative request is one that can later be completed or canceled.

A follower that is responsible for a leader’s durability should acknowledge receipt of tentative requests. Once the leader receives the necessary acknowledgements from its followers, the request has become durable and cannot be canceled. The leader can then issue messages to complete the tentative request.

We can view a request as having three stages:

  1. Incomplete: The request is in-flight and has not met the durability requirements. Such a request is marked as tentative among the followers. It can later be completed or canceled.
  2. Durable: The request is in-flight, but has met the durability requirements. This is an implicit, but important stage. We can trust that a durable request will never be canceled.
  3. Complete: The request is complete. Followers can mark the request as complete and perform any post-completion materializations as needed.

An optimization: Once a request becomes durable, the leader is free to transmit that request as complete to followers that have not yet received the message as a single step.

A leader that fails to make a request durable can keep retrying, but it must not attempt to cancel the request. This is because an elector could be performing a leadership change and may attempt to propagate the failed request. We will analyze this process in the next post.

Completion and cancellation

Completion and cancellation are mutually exclusive: A request that was completed will never be canceled, and a request that was canceled will never be completed.

When a follower receives a message to complete a request, it can perform the necessary steps to materialize the effects of the request. For example, if the request was meant to change the value of a variable, this change can now be applied.

If a cancellation message is received, the follower can delete the request, as if it never took place.

Timing the response

A leader could respond to the client with a success message as soon as it has become durable. However, it has the option of delaying the acknowledgement until it has also sent the completion message to the followers. Waiting until completion costs two round-trips and is therefore slower than an early response. On the other hand, it improves the performance of quorum reads. This trade-off may be necessary for systems that choose to implement reads using the quorum method. This is a bigger topic that would need a separate post.

For systems that use lock-based failovers, reads can be sent to the current leader instead of performing quorum reads. This allows for the leader to respond as soon as it has received the necessary acknowledgements.

Completion of requests in MySQL

The MySQL semi-sync protocol does not support this two-phase method of completing requests. When a replica receives a request, it immediately applies it. This behavior introduces some corner cases that require mitigation. I have covered some of those in an older post: Distributed durability in MySQL.

There is another MySQL behavior that is problematic: A primary that is restarted after a crash completes all in-flight requests without verifying that they received the necessary acks. This could lead to “split-brain” scenarios.

If time permits, I will make a separate post to cover more details, and on how to handle these scenarios.

Conclusion

In the next post, we will look at request propagation, which will tie all of this together.

Read the full Consensus Algorithms series