Blazing fast NVMe drives with unlimited IOPS now available. Read about PlanetScale Metal
Navigation

Blog|Engineering

Faster interpreters in Go: Catching up with C++

By Vicent Martí |

The SQL evaluation engine that ships with Vitess, the open-source database that powers PlanetScale, was originally implemented as an AST evaluator that used to operate directly on the SQL AST generated by our parser. Over this past year, we've gradually replaced it with a Virtual Machine which, despite being written natively in Go, performs similarly to the original C++ evaluation code in MySQL. Most remarkably, the new Virtual Machine has repeatedly proven itself easier to maintain than the original Go interpreter, even though it's orders of magnitude faster. Let's review the implementation choices we've made to get these surprising results.

What's a SQL evaluation engine?

Vitess has been designed for unlimited horizontal scaling. To accomplish this, all queries to a Vitess cluster must go through a vtgate. Since you can deploy as many vtgate instances as you want, because they’re essentially stateless, this allows you to grow the capacity of your cluster linearly. The job of each gate is the most complex part of the whole distributed system. It parses the SQL of the incoming queries and creates a shard-aware query plan, which we evaluate in one or many of the shards of the cluster. Then, we aggregate the results of these evaluations, and return them to the user.

One of the reasons why Vitess works so well in practice (in both performance and in ease of adoption) is that every shard in a cluster is backed by a real MySQL instance. Even the more complex SQL queries can be decomposed into simpler statements that are evaluated in the underlying MySQL database. Hence, the results of these queries always match what you’d expect from querying MySQL directly.

However, SQL queries in the real world can get really wild. We need to support pretty much every kind of query that a normal MySQL instance supports, but we need to evaluate it across several MySQL instances. This means that sometimes, we don’t get to fall back to MySQL to evaluate all our SQL expressions.

Think of a rather simple query such as this:

SELECT inventory.item_id, SUM(inventory.count), AVG(inventory.price) AS avg_price
FROM inventory
WHERE inventory.state = 'available' AND inventory.warehouse IN ? 
GROUP BY inventory.item_id
HAVING  avg_price > 100;

Assuming this query is executed in a sharded Vitess cluster, the inventoried items can exist in any of the shards. Hence, our query planner will prepare a plan that queries all shards in parallel, pushing down part of the aggregation to MySQL, and then we'll perform the aggregations (SUM and AVG) locally in the vtgate. The state and warehouse checks in the WHERE clause can and will be executed directly on the MySQL instance that powers each shard. But the last expression, avg_price > 100, applies to the result of the aggregation, which is only available inside Vitess. This is where the Vitess evaluation engine comes in.

Our evaluation engine is an interpreter that supports the majority of the scalar expressions in the SQL dialect used by MySQL. This does not include high level constructs such as performing a JOIN, the grouping of a GROUP BY, etc (these are performed directly by the planner, as we’ve seen), but the actual sub-expressions that you’d see as the condition of a WHERE clause, or a GROUP BY clause. Any piece of SQL that cannot be lowered to be executed in MySQL by the planner is evaluated locally in Go by the engine.

Of course, these SQL sub-expressions are not arbitrarily complex. They are not even Turing complete (as they cannot loop!), so you may think that a statement like avg_price > ? would be trivial to evaluate, but as in most engineering problems, there’s a wealth of nuance when doing these things in the real world.

SQL is an incredibly dynamic language full of quirks, and the SQL in MySQL, doubly so. We have spent an inordinate amount of time getting every single corner case of SQL evaluation to match exactly MySQL’s behavior. In fact, our test suite and fuzzer are so comprehensive that we routinely find bugs in the original MySQL evaluation engine, which we have to fix upstream (like this collation bug, this issue in the insert SQL function or this bug when searching substrings). Nonetheless, being fully accurate is not enough. For most queries, these expressions are evaluated once or even more than once for every returned row, so in order to not introduce additional overhead, evaluation needs to be as quick as possible.

As discussed earlier, the first version of the evaluation engine in Vitess was an AST-based interpreter, operating directly on top of the SQL AST generated by our parser. This was a very straightforward design that allowed us to focus on accuracy, at the expense of performance. Let's discuss our new design for replacing this interpreter with a fully fledged virtual machine which is both faster and easier to maintain. Starting with the basics.

The shapes of an interpreter

For those new to programming language implementations, there are roughly 3 ways to execute a dynamic language at runtime. In increasing level of complexity and performance:

  1. An AST-based interpreter, where the syntax of the language is parsed into an AST and evaluation is performed by recursively walking each node of the AST and computing the results. (this is the way the evalengine in Vitess used to work!)
  2. A bytecode VM, where the AST is compiled into binary bytecode that can be evaluated by a virtual machine — a piece of code that simulates a CPU, but with higher-level instructions. (this is what we've recently shipped!)
  3. A JIT compiler, in which the bytecode is compiled directly into the host platform's native instructions, so it can be executed directly by the CPU without being interpreted by a Virtual Machine. (we'll talk about this later!)

The first thing to consider here is whether the upgrade from an AST interpreter to a virtual machine makes sense from a performance point of view. Here’s an intuition: SQL expressions are incredibly dynamic (when it comes to typing), very high level (when it comes to each primitive operation), and with very little control flow (when it comes to evaluation -- SQL expressions don't really loop, and conditionals are rare; their flow is always lineal!). This can lead us to believe that there's no performance to be squeezed from translating the AST-based evaluation engine into bytecode. The AST is already well suited for high level operations and type-switching!

This is only superficially true. Lots of programming languages are highly dynamic and they manage to run in bytecode VMs much more efficiently than with an AST interpreter. A clear example of this is the now ancient transition that Ruby did from its original AST interpreter in MRI to YARV, a bytecode VM. Python also did a similar switch very early on. And you can bet that literally no JavaScript engines are using AST evaluation: even though the goal of these engines is to start running JS as soon as possible, they still compile to (very efficient) bytecode interpreters before JIT compilation kicks in.

So where’s the advantage of a virtual machine versus an AST interpreter? A lot of it boils down to instruction dispatching, which can be made very fast (more on this later!). But it is true that for SQL expressions, we’re actually going to execute very few instructions. Hence, to squeeze performance out of the VM, we’re going to have to come up with new tricks.

The initial approach I had in mind for our SQL virtual machine was based on Efficient Interpretation using Quickening by Stefan Brunthaler. The idea behind this paper is that dynamic programming languages are very hard to execute efficiently because of the lack of information about types. A simple expression such as a + 1 must be interpreted in a completely different way depending on whether a is an integer, a floating point number of even a string. To optimize these operations in practice, the paper suggests rewriting the bytecode from more generic instructions (e.g. the sum operator that needs to figure out the types of the two operands to know how to sum them) into specific static instructions which are specialized for the types they operate on at runtime (e.g. the sum operator that knows that both operands are integers and can sum them right away).

To do that, a quickening VM needs to figure out at runtime the types of the expressions being evaluated and incrementally rewrite the bytecode into instructions that operate on them. This is very hard to do in practice! But after implementing a good chunk of specialized instructions for the different types of operators in SQL and attempting to runtime rewrite them, I noticed an opportunity to take the idea even further by making it more efficient and, crucially, simpler.

It turns out that the semantic analysis we perform in Vitess is advanced enough that, through careful integration with the upstream MySQL server and its information schema, it can be used to statically type the AST of the SQL expressions we were executing. This took a lot of effort to implement, but resulted in a big win: since the planner knows the types of the actual inputs that will be used to evaluate each SQL expression, we can derive from those the types of all sub-expressions at compilation time, resulting in byte-code that is already specialized without requiring runtime rewriting.

Now we just need to implement a Virtual Machine to efficiently interpret the specialized bytecode!

An efficient Virtual Machine in Go

Implementing a VM usually involves a lot of complexity. As we’ve explained, you have to write a compiler that processes the input expression AST and generates the corresponding binary instructions (you have to come up with an encoding even!) and afterwards you have to implement the actual VM, which decodes each instruction and performs the corresponding operation. And you have to constantly keep these in sync! Any mismatches between the compiler that emits the bytecode and the VM that executes it are often catastrophic and very hard to debug.

Historically, a bytecode VM has always been implemented the same way: a big-ass switch statement. You decode an instruction, and switch on its type to jump to the operation that needs to be performed. This is often a performance advantage against AST interpreters, because switching in practice is quite fast (particularly when implemented in C or C++ like most VMs are), and allows execution to happen linearly, without recursion.

This design, however, also has its fair share of shortcomings. Mike Pall, JIT-master extraordinaire and author of LuaJIT, gives a very insightful rundown of these issues on this mailing list post from 2011. Allow me to summarize for this blog: Besides the fact that the VM's instructions need to be kept in-sync with the compiler, the actual performance of the main VM loop in a language with many instructions is not great in practice because compilers usually struggle when compiling massive functions, and these functions are massive. They spill registers all over the place on each branch of the switch, because it's hard to tell which branches are hot and which ones are cold. With all the pushing and popping, the jump into the switch's branch often looks more like a function call, so a lot of the performance benefits of the virtual machine dissipate.

Mike was discussing C compilers in that post, but it's safe to assume that these problems are the same for a virtual machine implemented in Go. After a lot of testing, I can assure you that they are actually much worse because the Go compiler is not great at optimization. There’s always a trade-off between optimization and fast compile times, and the Go authors have historically opted for the latter.

One key issue for Go is that often the different branches of the switch statement are jumped to via binary search instead of a jump table. Switch jump table optimization was implemented surprisingly late on the compiler, and in practice it is very fiddly, without any way to enforce it. You have to tweak the way the VM's instructions are encoded carefully to ensure that you're jumping in the VM's main loop, and you have no way to reliably check whether your virtual machine’s dispatch code has been properly optimized besides reviewing the generated assembly yourself.

Clearly, switch-based VM loops are not the state of the art for writing efficient interpreters, neither in Go nor in any other programming language. So what is the state of the art then? Well, when it comes to Go it turns out that there's nobody doing fast interpreters right now (at least nobody I can find). The people who are doing interesting work here, such as the wazero WASM implementation are focusing their performance efforts on JIT. So we’re going to have to innovate!

Outside of Go, the most interesting approach for interpreters implemented in C or C++ is continuation-style evaluation loops, as seen in this report from 2021 that implements this technique for parsing Protocol Buffers. This involves implementing all the opcodes for the VM as freestanding functions that operate on the VM as an argument, with the return of the function being a callback to the next step of the computation. It does sound like something expensive and, huh, recursive, but the trick is that newer versions of LLVM allow us to mark functions as forcefully tail-called (see: https://en.wikipedia.org/wiki/Tail_call), so the resulting code is not recursively calling the VM loop but instead jumping between the operations and using the free-standing functions as an abstraction to control register placement and spillage. The most recent release of Python 3.14 actually ships an interpreter based on this design, boasting up to 30% improvement when executing Python code.

Unfortunately, this is not something we can do in Go because as we discussed earlier, the Go compiler is allergic to optimization. It can sometimes emit tail calls, but it needs to be tickled in just the right way, and this implementation simply does not work in practice unless the tail-calls are guaranteed at compilation time. But what if we keep the same design with free-standing functions for each instruction and instead of tail-calling, we forcefully return control to the evaluation loop after each one? This could be implemented very easily by not emitting our compiled program as “byte code”, but instead emitting a slice of function pointers to each instruction. The design may be a bit counter-intuitive, but it has a lot of very interesting properties.

First, the VM becomes trivial! It's just a few lines of code, and it doesn't have to worry about optimizing any large switch statements. It's just repeatedly calling functions one after the other! Here’s a simplified example, but if you check the actual implementation in Vitess you’ll see that a real virtual machine implementation is hardly more complicated than this.

func (vm *VirtualMachine) execute(p *Program) (eval, error) {
	code := p.code
	ip := 0

	for ip < len(code) {
		ip += code[ip](vm)
		if vm.err != nil {
			return nil, vm.err
		}
	}
	if vm.sp == 0 {
		return nil, nil
	}
	return vm.stack[vm.sp-1], nil
}

All we need to return when executing each instruction is the offset for the instruction pointer ip. Most functions return 1, which causes the next instruction to be executed, but by returning negative or positive values, you can implement all control flow, including loops and conditionals.

Besides the greatly simplified virtual machine, the second advantage of this approach is that the compiler also becomes trivial, because there is no bytecode! Instead, the compiler emits the individual instructions directly by pushing "callbacks" into a slice. There are no instruction opcodes to keep track off, no encoding to perform and nothing to keep in sync with the VM. Developing the compiler means developing the VM simultaneously, which greatly improves iteration speed and prevents a whole class of bugs that happen often when developing virtual machines.

func (c *compiler) emitPushNull() {
	c.emit(func(vm *VirtualMachine) int {
		vm.stack[vm.sp] = nil
		vm.sp++
		return 1
	})
}

As you may notice, there’s a bit of a hiccup here when it comes to modeling the instructions for a non-trivial language: if there's no instruction encoding, then we cannot have instructions with arguments.

This is a big problem in a language like C (traditionally used to implement most programming language interpreters), which is why this technique is never seen there. But it’s actually not a problem for us, because the Go compiler actually supports closures! We can emit any instruction we want and the Go compiler will automatically capture its arguments inside the callback. We don't have to think about how to encode our arguments in the bytecode, and in fact, our arguments can be as complex as they need to be: the resulting callback will contain a copy of them created by the Go compiler. It's essentially a poor man's JIT, aided by the compiler, and it works amazingly well in practice, both performance-wise and for ergonomics.

Check out this compiler method that generates an instruction to push a TEXT SQL object from the input rows into the stack:

func (c *compiler) emitPushColumn_text(offset int, col collations.TypedCollation) {
	c.emit(func(vm *VirtualMachine) int {
		vm.stack[vm.sp] = newEvalText(vm.row[offset].Raw(), col)
		vm.sp++
		return 1
	})
}

Both the offset in the input rows array and the collation for the text are statically baked into the generated instruction!

Almost statically typed

With the fully static typing for SQL expressions (derived from the type information in the planner) we get to design an extremely efficient virtual machine where every single instruction is specialized for the type of the operands it executes on. This is both the optimal and the simplest design for a VM because we never have to do type switching during evaluation. But we’re dealing with SQL here (or, more accurately, the SQL dialect of MySQL), so not everything is rainbows and unicorns. Very often it’s quite the opposite.

Let’s consider this wildly complex SQL expression: -inventory.price. That is, the negation of each of the values in the inventory.price column of our query. We know (thanks to our semantic analysis, and the schema tracker) that the type of the inventory.price column is BIGINT. So what could be the type of -inventory.price? Naive readers without experience in the magical world of SQL may believe the resulting type is BIGINT, but that’s not the case in practice!

The vast majority of the time, the negation of a BIGINT yields indeed another BIGINT value. But when the actual value of the BIGINT is -9223372036854775808 (i.e. the smallest value that can be represented in 64 bits), negating it promotes the value into a DECIMAL, instead of silently truncating it, or returning an error. You can see how this can easily throw a wrench in our statically compiled instructions for our virtual machine. Suddenly the static type checking we’ve computed is no longer valid because the types of the expression no longer depend on the types of the inputs, but on the actual values of the inputs. In order to continue evaluating the result of this negation, we’d always have to type-check again at runtime, defeating the whole point of static typing to begin with.

To work around this issue, we’re not introducing more type switches at runtime. We’re using a classic trick which can be seen all the time in JIT compiled code and very rarely, if ever, in virtual machines: de-optimization. There’s a small list of expressions where corner cases (e.g. overflow) can result in dynamic typing at runtime. Whenever this happens, we simply bail out of executing in our virtual machine and fall back to executing on the old AST evaluator, which has always performed type switching at runtime. This is very similar to what JIT compilers do when they detect that the runtime type of a value no longer matches the generated code they’ve emitted; they fall back from the native code to the virtual machine. In our case, we’re one step behind, falling back from the virtual machine to the AST interpreter, but the performance implications are the same. This design allows us to keep our interpreter executing statically typed code without any type switches at runtime. Here's an example of what integer negation looks like when compiled:

func (c *compiler) emitNeg_i() {
	c.emit(func(vm *VirtualMachine) int {
		arg := vm.stack[env.vm.sp-1].(*evalInt64)
		if arg.i == math.MinInt64 {
			vm.err = errDeoptimize
		} else {
			arg.i = -arg.i
		}
		return 1
	})
}

There is one significant drawback with this approach, however: the code for the AST interpreter can never be removed from Vitess. But this is, overall, not a bad thing. Just like most advanced language runtimes keep their virtual machine interpreter despite having a JIT compiler, having access to our classic AST interpreter gives us versatility. It can be used when we detect that an expression will be evaluated just once (e.g. when we use the evaluation engine to perform constant folding on a SQL expression). In those cases, the overhead of compiling and then executing on the VM trumps a single-pass evaluation on the AST. Lastly, when it comes to accuracy, being able to fuzz both the AST interpreter and the VM against each other has resulted in an invaluable tool for detecting bugs and corner cases.

Conclusion

This technique for virtual machine implementation is not fully novel (I’ve seen it used before for a rules-based authorization engine in the wild!), but as far as I can tell it has never been used in Go. Given the constraints of the language and the compiler, the technique yields spectacular results: the new SQL interpreter in Vitess is just faster. Faster to write, faster to maintain and faster to execute. The benchmarks speak for themselves:

Evalengine performance in Vitess over time

Benchmark results

Here we have a performance comparison of 5 different queries (ranging from very complex to very simple) between three implementations:

  1. old, which is the original AST-based dynamic implementation of the evalengine.
  2. ast, which is the result of adding static type checking to the virtual machine and using them to partially optimize the AST evaluator.
  3. vm, which is the callback-based virtual machine implementation as discussed in this post.

Recent results compared with MySQL

Benchmark results With MySQL

This is the current performance of our evaluation engine pitted against the native C++ implementation in MySQL. Note that measuring the time that MySQL spends in evaluation is very tricky; these are not the total reponse times for a query, but the result of manual instrumentation in the mysqld server to ensure a fair comparison.

Raw benchmark data
                                      │     ast      │                 vm                  │                  mysql                   │
                                      │    sec/op    │   sec/op     vs base                │    sec/op     vs base                    │
CompilerExpressions/complex_arith-32    162.75n ± 1%   50.77n ± 1%  -68.81% (p=0.000 n=10)   49.40n ±  5%  -69.64% (p=0.000 n=10+184)
CompilerExpressions/comparison_i64-32    30.30n ± 2%   16.95n ± 1%  -44.08% (p=0.000 n=10)   26.93n ± 22%  -11.12% (p=0.000 n=10+11)
CompilerExpressions/comparison_u64-32    30.57n ± 3%   17.49n ± 1%  -42.78% (p=0.000 n=10)   18.80n ±  9%  -38.53% (p=0.000 n=10+16)
CompilerExpressions/comparison_dec-32    70.75n ± 1%   52.58n ± 2%  -25.68% (p=0.000 n=10)   46.59n ±  5%  -34.14% (p=0.000 n=10+14)
CompilerExpressions/comparison_f-32      53.05n ± 1%   25.65n ± 1%  -51.64% (p=0.000 n=10)   27.75n ± 23%  -47.69% (p=0.000 n=10)
geomean                                  56.30n        28.94n       -48.60%                  31.76n        -43.58%

                                      │    ast     │                   vm                    │
                                      │    B/op    │    B/op     vs base                     │
CompilerExpressions/complex_arith-32    96.00 ± 0%    0.00 ± 0%  -100.00% (p=0.000 n=10)
CompilerExpressions/comparison_i64-32   16.00 ± 0%    0.00 ± 0%  -100.00% (p=0.000 n=10)
CompilerExpressions/comparison_u64-32   16.00 ± 0%    0.00 ± 0%  -100.00% (p=0.000 n=10)
CompilerExpressions/comparison_dec-32   64.00 ± 0%   40.00 ± 0%   -37.50% (p=0.000 n=10)
CompilerExpressions/comparison_f-32     16.00 ± 0%    0.00 ± 0%  -100.00% (p=0.000 n=10)

                                      │    ast     │                   vm                    │
                                      │ allocs/op  │ allocs/op   vs base                     │
CompilerExpressions/complex_arith-32    9.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)
CompilerExpressions/comparison_i64-32   1.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)
CompilerExpressions/comparison_u64-32   1.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)
CompilerExpressions/comparison_dec-32   3.000 ± 0%   2.000 ± 0%   -33.33% (p=0.000 n=10)
CompilerExpressions/comparison_f-32     2.000 ± 0%   0.000 ± 0%  -100.00% (p=0.000 n=10)

The results are stark: the pre-compiled SQL expressions when ran in the new VM are up to 20x times faster than the first implementation of SQL evaluation in Vitess, and for most cases, we've caught up with the performance of the C++ implementation in MySQL. One further detail which is not shown on the graphs, but can be seen on the raw benchmark data, is that the new virtual machine does not allocate memory to perform evaluation — a very nice side effect of the fully specialized instructions thanks to the static type checking.

Overall, we consider getting in the same performance ballpark as MySQL's C++ evaluation engine as a huge engineering success, particularly when the resulting implementation is so easy to maintain. There will always be a performance gap between Go and C++, arising from the trade-off of quality vs compilation speed in the Go compiler, and from the semantics of the language itself, but as we show here, this gap is not insurmountable. With expertise and careful design, it is possible to reap the many benefits of developing and deploying Go services without paying the performance penalty inherent in the language. In this specific case, we got there by having the capacity to perform semantic analysis and statically typing SQL expressions (something which MySQL does not do), and by choosing an efficient virtual machine design that uses the strengths of Go instead of fighting its limitations.

Addendum: So why not JIT?

Inquiring minds may be wondering: what's next? Are we doing JIT compilation next? The answer is no. Although this design for a compiler and VM looks like an exceptional starting point for implementing a full JIT compiler in theory, in practice the trade-off between optimization and complexity doesn't make sense. JIT compilers are important for programming languages where their bytecode operations can be optimized into a very low level of abstraction (e.g. where an "add" operator only has to perform a native x64 ADD). In these cases, the overhead of dispatching instructions becomes so dominant that replacing the VM's loop with a block of JITted code makes a significant performance difference. However, for SQL expressions, and even after our specialization pass, most of the operations remain extremely high level (things like "match this JSON object with a path" or "add two fixed-width decimals together"). The overhead of instruction dispatch, as measured in our benchmarks, is less than 20% (and can possibly be optimized further in the VM's loop). 20% is not the number you're targetting before you start messing around with raw assembly for a JIT. So at this point my intuition is that JIT compilation would be a needlessly complex dead optimization.