Generics can make your Go code slower

Go 1.18 is here, and with it, the first release of the long-awaited implementation of Generics is finally ready for production usage. Generics are a frequently requested feature that has been highly contentious throughout the Go community. On the one side, vocal detractors worry about the added complexity. They fear the inescapable evolution of Go towards either a verbose and Enterprisey Java-lite with Generic Factories or, most terrifyingly, a degenerate HaskellScript that replaces ifs with Monads. In all fairness, both these fears may be overblown. On the other side, proponents of generics believe that they are a critical feature to implement clean and reusable code at scale.

This blog post does not take sides in that debate, or advise where and when to use Generics in Go. Instead, this blog post is about the third side of the generics conundrum: It’s about systems engineers who are not excited about generics per se, but about monomorphization and its performance implications. There are dozens of us! Dozens! And we’re all due for some serious disappointment.

The generics implementation in 1.18

There are many different ways to implement parametric polymorphism (what we usually call “generics”) in a programming language. Let us briefly discuss this problem space to understand the solution that has been implemented in Go 1.18. Since this is a blog post about Systems Engineering, we’ll make this type-theory discussion light and painless. And instead of technical terms, we’ll use the word “things” often.

Say you want to create a polymorphic function, i.e. a function that operates on different things indistinctly. There are, broadly speaking, two ways to go about this.

The first way is making all the things that the function will operate on look and act the same way. This approach is called “boxing”, and it usually involves allocating the things on the heap and just passing pointers to them to our function. Since all the things have the same shape (they’re pointers!), all we need to operate on them is knowing where the methods for those things live. Hence, the pointers to the things that are passed to our function are usually accompanied by a table of function pointers, often called a “virtual method table” or vtable for short. Does this ring a bell? This is how Go interfaces are implemented, but also dyn Traits in Rust, and virtual classes in C++. These are all forms of polymorphism that are easy to use in practice but are limited by their expressiveness and by their runtime overhead.

The second way to make a function operate on many different things is called “monomorphization.” The name may sound scary, but its implementation is relatively straightforward. It boils down to creating a different copy of the function for every unique thing it must operate on. That’s it, really. If you have a function that adds two numbers, and you call it to add two float64s, the compiler creates a copy of the function and replaces the generic type placeholder with float64, then compiles that function. It is by far the easiest way to implement polymorphism (even if sometimes it becomes quite hard to use in practice) and it is also the most expensive – for the compiler.

Historically, monomorphization has been the design of choice for implementing Generics in systems languages such as C++, D, or Rust. There are many reasons for this, but it all boils down to trading longer compile times for significant performance gains in the resulting code. When you replace the type placeholders in generic code with their final types before the compiler has performed any optimization passes, you create an exciting universe of optimizations that are essentially impossible when using boxed types. At the very least, you get to de-virtualize function calls and get rid of virtual tables; in the best case scenario, you get to inline code, which in turn enables further optimizations. Inlining code is great. Monomorphization is a total win for systems programming languages: it is, essentially, the only form of polymorphism that has zero runtime overhead, and often it has negative performance overhead. It makes generic code faster.

So, as somebody who works on performance for large Go applications, I admit I was not particularly excited about Generics in Go, really. I was excited about monomorphization, and about the potential for the Go compiler to perform optimizations that it simply can’t do when it’s dealing with interfaces. Cue my disappointment: The Generics implementation in Go 1.18 does not use monomorphization… at least, not quite.

It is actually based on a partial monomorphization technique called “GCShape stenciling with Dictionaries.” The full details behind this technical choice can be found in this design document available in the upstream repository. For the sake of completeness, and to guide this post’s performance analysis, I am going to give a quick summary of it:

The core idea is that, since fully monomorphizing every function call based on its input arguments would result in a very significant amount of code being generated, we can reduce the amount of unique function shapes by monomorphizing at a broader level than the type of the arguments. Hence, in this implementation of Generics, the Go compiler performs monomorphization (what they call “stenciling”) based on the GCShape of the arguments, not their type. The GCShape of a type is an abstract concept specific to Go and this implementation of Generics. As the design document states, two concrete types are in the same gcshape grouping if and only if they have the same underlying type or they are both pointer types. The first part of this definition is straightforward: if you have a method that performs, for example, arithmetic on its arguments, the Go compiler will effectively monomorphize it based on their types. The generated code for uint32, using integral arithmetic instructions, will be different to the code for float64, which will use floating point instructions. The generated code for a type-alias to uint32, on the other hand, will be the same as for the underlying uint32.

So far, so good. However, the second part of the GCShape definition has huge performance implications. Let me emphasize this: All pointers to objects belong to the same GCShape, regardless of the object being pointed at. This means that a *time.Time pointer has the same GCShape as an *uint64, a *bytes.Buffer and a *strings.Builder. This might make you wonder: “Huh, so, what happens when we want to call a method on these objects? The location of such method cannot possibly be part of the GCShape!”. Well, the name of the design spoils this for us: The GCShapes do not know about methods, so we need to talk about the dictionaries that accompany them.

In the current implementation of Generics in 1.18, every runtime invocation of a generic function will transparently receive as its first argument a static dictionary with metadata about the arguments being passed to the function. The dictionary will be placed in register AX for AMD64, and in the stack in platforms where the Go compiler doesn’t support register-based calling conventions yet. The full implementation details for these dictionaries are explained in depth in the aforementioned design document, but as a summary, they include all the required type metadata to pass the arguments to further generic functions, to convert them from/to interfaces, and most relevantly to us, to call methods on them. That’s right, after the monomorphization step, the generated function shape needs to take as a runtime input the virtual method tables for all its generic arguments. Intuitively, we can venture that while this greatly reduces that amount of unique code being generated, this kind of broad monomorphization does not lend itself to de-virtualization, to inlining, or really to any kind of performance optimization.

In fact, it might seem that for the vast majority of Go code, making it generic will imply making it slower. But before we start sinking into a deep pit of despair, let us run some benchmarks, look at some assembly and verify some behaviors.

Interface inlining

Vitess, the Open-Source distributed database that powers PlanetScale, is a large and complex real world Go application that serves as a great test-bed for new Go language features, particularly performance-related ones. I happen to have a long list of functions and implementations in Vitess that are currently manually monomorphized (this is a fancy way of saying “copy and pasted, but with different types”). Some of these functions are duplicated because their polymorphism cannot be modeled with interfaces; some others are duplicated because they are performance-critical and having them compile without interfaces provides a measurable performance gain.

Let’s look at a good candidate from this list: the BufEncodeSQL functions in the sqltypes package. These functions have been duplicated to take either a *strings.Builder or a *bytes.Buffer because they perform many calls to the provided buffer, and these calls can be inlined by the compiler if the buffer is passed as an unboxed type, as opposed to an interface. This results in meaningful performance wins in a function that is used extensively throughout the codebase.

Making this code generic is trivial, so let’s do so and compare the generic version of the function against the simple version that takes an io.ByteWriter as an interface.

"".BufEncodeStringI
"".BufEncodeStringSQL[go.shape.*uint8_0]

For the best experience view on desktop.

There is nothing surprising in the assembly for the io.ByteWriter version: all the calls to WriteByte happen through the itab. We’ll review exactly what this means in just a moment. The generic version gets much more interesting, though. The first thing we see is that the compiler has generated a single shape instantiation for the function (BufEncodeStringSQL[go.shape.*uint8_0]). Although we don’t show it on the inline view, we have to call the generic function with a *strings.Builder from reachable code; otherwise the compiler won’t generate any instantiations for the function at all:

var sb strings.Builder
BufEncodeStringSQL(&sb, []byte(nil))

Since we called the function with a *strings.Builder as its argument, we see in the generated assembly the shape for *uint8. As explained earlier, all the generic calls that take a pointer-to as a generic argument are stenciled into *uint8, regardless of the object being pointed at. The actual properties of the object --and most importantly, its itab– are stored in the dictionary that is passed to the generic function.

This all matches what we read in the design document: the stenciling process for passing a pointer-to-struct monomorphizes the pointer into a void-like pointer. No other attributes from the pointed-at object are taken into account during monomorphization, and hence, no inlining is possible. The information on the methods for the struct that could be inlined is only available in the dictionary at runtime. This is already a bummer: we’ve seen that the design behind this stenciling doesn’t allow for de-virtualization of function calls, and consequently, it does not provide any inlining opportunities to the compiler. But wait, it gets worse!

There’s an insightful performance analysis that we can make in this chunk of code by comparing the assembly generated to call the WriteByte method in the interface code against the generic code.

Intermission: Calling an interface method in Go

Before we can compare the calls between the two versions of the code, we need a quick refresher on how interfaces are implemented in Go. We’ve briefly touched on the fact that interfaces are a form of polymorphism that involves boxing, i.e. ensuring that all objects we’re operating on have the same shape. This shape, for a Go interface, is a fat 16 byte pointer (iface) where the first half points to the metadata about the boxed value (what we call the itab), and the second half points to the value itself.

type iface struct {
	tab *itab
	data unsafe.Pointer
}

type itab struct {
	inter *interfacetype // offset 0
	_type *_type // offset 8
	hash  uint32 // offset 16
	_     [4]byte
	fun   [1]uintptr // offset 24...
}

The itab contains a lot of information about the type inside an interface. The inter, _type and hash fields contain all the required metadata to allow conversion between interfaces, reflection and to switch on the type of the interface. But what we care about here is the fun array at the end of the itab: although displayed as a [1]uintptr in the type description, this is actually a variable-length allocation. The size of the itab struct changes between specific interfaces, with enough space at the end of the struct to store a function pointer for every method in the interface. These function pointers are what we need to access every time we want to call a method on the interface; they are the Go equivalent to a C++ virtual table.

With this in mind, we can now understand the call assembly for an interface method in the non-generic implementation of our function. This is what line 8, buf.WriteByte('\\'), compiles into:

0089  MOVQ "".buf+48(SP), CX
008e  MOVQ 24(CX), DX
0092  MOVQ "".buf+56(SP), AX
0097  MOVL $92, BX
009c  CALL DX

To call the WriteByte method on buf, the first thing we need is a pointer to the itab for buf. Although buf was originally passed into our function in a pair of registers, the compiler spilled it into the stack at the start of the function body so it can use the registers for other things. To call a method on buf, we first must load the *itab from the stack back into a register (CX). Now, we can dereference the itab pointer in CX to access its fields: we move the double word at offset 24 into DX, and a quick peek at the original definition of itab above shows that, indeed, the first function pointer in the itab resides at offset 24 – this all makes sense so far.

With DX containing the address of the function we want to call, we’re just missing its arguments. What Go calls a “struct-attached method” is sugaring for a freestanding function that takes its receiver as its first argument, e.g. func (b *Builder) WriteByte(x byte) desugars to func "".(*Builder).WriteByte(b *Builder, x byte). Hence, the first argument to our function call must be buf.(*iface).data, the actual pointer to a strings.Builder that lives inside our interface. That pointer is available in the stack, 8 bytes after the tab pointer we’ve just loaded. Finally, the second argument to our function is the literal \\, (ASCII 92), and we can CALL DX to execute our method.

Phew! That’s some effort to call a simple method. Although in practical performance terms, it’s not so bad. Besides the fact that a call through an interface always prevents inlining, the actual overhead of a call is a single pointer dereference to load the function address from inside the itab. In a moment we’re going to benchmark this to see how expensive that dereference is, but first, let us look at the codegen for the generic code.

Back to generics: Pointer calls

We’re back to the assembly for our generic function. As a reminder, we’re analyzing the generated instantiation shape for *uint8, because all pointer instantiation shapes use the same void-like pointer type. Let’s see what calling the WriteByte method on buf looks like here:

008f  MOVQ ""..dict+48(SP), CX
0094  MOVQ 64(CX), CX
0098  MOVQ 24(CX), CX
009c  MOVQ "".buf+56(SP), AX
00a1  MOVL $92, BX
00a6  CALL CX

It looks quite familiar, but there’s a stark difference. Offset 0x0094 contains what we don’t want a function call-site to contain: another pointer dereference. The technical term for this is, again, a total bummer. Here’s what’s going on: since we’re monomorphizing all pointer shapes into a single shape instantiation for *uint8, the shape doesn’t contain any information about the methods that can be called on those pointers. Where would that information live? Ideally, it would live in the itab associated to our pointer, but there’s no itab directly associated to our pointer because the shape of our function takes a single 8-byte pointer as its buf argument, as opposed to a 16-byte fat pointer with an *itab and data fields, like an interface would. If you recall, this is the whole reason why the stenciling implementation passes a dictionary to every generic function call: this dictionary contains pointers to the itabs of all the generic arguments to the function.

Alright, so that assembly, with the extra load, makes perfect sense now. The method call begins, instead of by loading the itab for our buf, by loading the dictionary that has been passed to our generic function (and which has also been spilled into the stack). With the dictionary in CX, we can dereference it, and at offset 64 we find the *itab we were looking for. Sadly, we now need yet another dereference (24(CX)), to load the function pointer from inside the itab. The rest of the method call is identical to the previous codegen.

How bad is this extra dereference in practice? Intuitively, we can assume that calling methods on an object in a generic function will always be slower than in a non-generic function that simply takes an interface as an argument, because Generics will devolve what previously were pointer calls into a twice-indirect interface call, ostensibily slower than a plain interface call.

name                      time/op      alloc/op     allocs/op
Monomorphized-16          5.06µs ± 1%  2.56kB ± 0%  2.00 ± 0%
Iface-16                  6.85µs ± 1%  2.59kB ± 0%  3.00 ± 0%
GenericWithPtr-16         7.18µs ± 2%  2.59kB ± 0%  3.00 ± 0%

This simple benchmark tests the same function body with 3 slightly different implementations. GenericWithPointer passes a *strings.Builder to our func Escape[W io.ByteWriter](W, []byte) generic function. The Iface benchmark is for a func Escape(io.ByteWriter, []byte) that takes an interface directly. Monomorphized is for a manually monomorphized func Escape(*strings.Builder, []byte).

The results are not surprising. The function that has been specialized to take a *strings.Builder directly is the fastest, because it allowed the compiler to inline the WriteByte calls inside of it. The generic function is measurably slower than the simplest possible implementation taking an io.ByteWriter interface as an argument. We can see that the impact of the extra load from the generic dictionary is not significant, because both the itab and the Generics dictionary will be very warm in cache in this micro-benchmark (however, do keep reading for an analysis of how cache contention affects Generic code).

This is the first insight we can gather from this analysis: there’s no incentive to convert a pure function that takes an interface to use Generics in 1.18. It’ll only make it slower, as the Go compiler cannot currently generate a function shape where methods are called through a pointer. Instead, it will introduce an interface call with two layers of indirection. This is going in the exact opposite direction of what we’d like, which is de-virtualization and, where possible, inlining.

Before we wrap up this section, let’s point out a detail in the Go compiler’s escape analysis: we can see that our monomorphized function has 2 allocs/op in our benchmark. This is because we’re passing a pointer to a strings.Builder that lives in the stack, and the compiler can prove that it doesn’t escape and hence it doesn’t need to be heap allocated. The Iface benchmark shows 3 allocs/op even though we’re also passing a pointer from the stack. This is because we’re moving the pointer to an interface, and that always allocates. Surprisingly, the GenericWithPointer implementation also shows 3 allocs/op. Even though the generated instantiation for the function takes the pointer directly, the escape analysis can no longer prove it as non-escaping, so we get an extra heap allocation. Oh well. That is a small disappointment, but it is now time to move to bigger, better disappointments.

Generic interface calls

These past few sections we’ve been analyzing the codegen for our generic Escape function by looking at the shape generated when calling the function with a *strings.Builder. If you recall, the generic signature for our method was func Escape[W io.ByteWriter](W, []byte), and *strings.Builder definitely fulfills that constraint, resulting in an instantiation shape for *uint8.

But what would happen if instead we were to hide our *strings.Builder behind an interface?

var buf strings.Builder
var i io.ByteWriter = &buf
BufEncodeStringSQL(i, []byte(nil))

The argument to our generic function is now an interface, instead of a pointer. And the call is clearly valid, as the interface we’re passing is identical to the constraint on our method. But what does the instantiation shape we’re generating look like? We’re not embedding the full disassembly because it gets real noisy, but just like we did before, let us analyze the call sites for the WriteByte methods in the function:

00b6  LEAQ type.io.ByteWriter(SB), AX
00bd  MOVQ ""..autotmp_8+40(SP), BX
00c2  CALL runtime.assertI2I(SB)
00c7  MOVQ 24(AX), CX
00cb  MOVQ "".buf+80(SP), AX
00d0  MOVL $92, BX
00d5  CALL CX

Big yikes! Compared to our previous codegen, this does look much less familiar. We agreed (and measured) that an extra dereference on each call site was a not a good thing, so imagine how we should feel about a whole extra function call.

What’s going on here? We can find the runtime.assertI2I method inside the Go runtime: it’s the helper that asserts a conversion between interfaces. It takes an *interfacetype and an *itab as its two arguments, and returns an itab for the given interfacetype only if the interface in the given itab also implements our target interface. Huh, what?

Say you have an interface like this:

type IBuffer interface {
	Write([]byte) (int, error)
	WriteByte(c byte) error
	Len() int
	Cap() int
}

This interface makes no mention of io.ByteWriter or io.Writer, and yet, any type that implements IBuffer also implements these two interfaces implicitly. This has a meaningful impact in the codegen of our generic function: since the generic constraint to our function is [W io.ByteWriter], we can pass as an argument any interface that implements io.ByteWriter – this includes something like IBuffer. But when we need to call the WriteByte method on our argument, where in the itab.fun array for the interface we’ve received does this method live? We don’t know! If we pass our *strings.Builder as an io.ByteWriter interface, the itab in that interface will have our method at fun[0]. If we pass it as an IBuffer, it’ll be at fun[1]. What we need is a helper that can take the itab for an IBuffer and return an itab for an io.ByteWriter, where our WriteByte function pointer is always stable at fun[0].

That’s the job of assertI2I, and that’s what every call site in the function is doing. Let’s break it down step by step.

00b6  LEAQ type.io.ByteWriter(SB), AX
00bd  MOVQ ""..autotmp_8+40(SP), BX
00c2  CALL runtime.assertI2I(SB)
00c7  MOVQ 24(AX), CX
00cb  MOVQ "".buf+80(SP), AX
00d0  MOVL $92, BX
00d5  CALL CX

First, it loads the interfacetype for io.ByteWriter (which is a hard-coded global, since this is the interface type defined in our constraint) into AX. Then, it loads the actual itab for the interface we passed to our function into BX. These are the two arguments that assertI2I needs, and after calling it, we’re left with the itab for io.ByteWriter in AX, and we can continue with our interface function call like we did in the previous codegen, knowing that our function pointer is now always at offset 24 inside our itab. Essentially, what this shape instantiation is doing is converting every method call from buf.WriteByte(ch) to buf.(io.ByteWriter).WriteByte(ch).

And yes, that looks expensive as hell. And yes, it also looks very redundant. Wouldn’t it be possible to acquire the io.ByteWriter itab just once at the start of the function and re-use it in all function calls? Ehh, not in the general case, but there are function shapes where this would be safe to do (like, for instance, the function we’re currently analyzing), since the value inside the buf interface never changes and we don’t need to type-switch or pass down the buf interface to any further functions down the stack. There’s definitely some room for optimization here in the Go compiler. Let’s look at the benchmark numbers to see how much impact would such an optimization have:

name                      time/op      alloc/op     allocs/op
Monomorphized-16          5.06µs ± 1%  2.56kB ± 0%  2.00 ± 0%
Iface-16                  6.85µs ± 1%  2.59kB ± 0%  3.00 ± 0%
GenericWithPtr-16         7.18µs ± 2%  2.59kB ± 0%  3.00 ± 0%
GenericWithExactIface-16  9.68µs ± 2%  2.59kB ± 0%  3.00 ± 0%

That’s not great. The overhead of the assertI2I call is noticeable, even in a function that does more than simply calling other functions. We’re almost twice as slow as the manually monomorphized function that calls WriteByte directly, and 30% slower than simply using an io.ByteWriter interface without generics. This is, by any definition, a performance footgun to be aware of: the same generic function, with the same argument, will be significantly slower if you pass the argument inside an interface as opposed to directly as a pointer.

…But wait! We’re not done here! There’s still more fascinating performance details to share, as you may have guessed from the careful naming of our benchmark cases. It turns out that our GenericWithExactIface benchmark is actually the best possible scenario, because the constraint in our function is [W io.ByteWriter] and we’re passing our argument as an io.ByteWriter interface. This means that the runtime.assertI2I call will return immediately with the itab we passed to it – because it matches the itab that our shape instantiation is looking for. But what if we passed our argument as the previously defined IBuffer interface? This should work just fine, because *strings.Builder implements both IBuffer and io.ByteWriter, but at runtime, every single method call inside our function will result in a global hash table lookup when assertI2I tries to acquire an io.ByteWriter itab from our IBuffer argument.

name                      time/op      alloc/op     allocs/op
Monomorphized-16          5.06µs ± 1%  2.56kB ± 0%  2.00 ± 0%
Iface-16                  6.85µs ± 1%  2.59kB ± 0%  3.00 ± 0%
GenericWithPtr-16         7.18µs ± 2%  2.59kB ± 0%  3.00 ± 0%
GenericWithExactIface-16  9.68µs ± 2%  2.59kB ± 0%  3.00 ± 0%
GenericWithSuperIface-16  17.6µs ± 3%  2.59kB ± 0%  3.00 ± 0%

Haha, awesome. This is a very cool insight. We’ve upgraded from a performance footgun to a footcannon, and this all depends on whether the interface you’re passing to a generic function matches exactly its constraint or is a super-set of the constraint. This is possibly the most salient point of this analysis: passing interfaces to a generic function in Go is never a good idea. In the best case scenario, if your interface matches exactly the constraint, you’re going to see significant overhead from every method call on your types. In the likely case where your interface is a superset of the constraint, every single method call will have to be resolved dynamically from a hash table, and no caching has been implemented for this functionality.

Before we wrap up this section, here’s something extremely important to take into account when figuring out whether the overhead of Go Generics is suitable for your use case: the numbers shown in this benchmark are best-case values, particularly for interface calls, and not representative of the overhead a function call would have in a real world application. These micro-benchmarks are being ran on a vacuum, where the itab and dictionaries for the generic function are always warm in cache, and the global itabTable that enables assertI2I is empty and uncontended. In an actual production service there is cache contention, and the global itabTable can contain from hundreds to millions of entries, depending on how long your service has been running and the amount of unique type/interface pairs in your compiled code. This means that Generic method call overhead in your Go programs will degrade with the complexity of your codebase. This is nothing new, as the degradation actually affects all interface checks in a Go program, but these interface checks are usually not performed in tight loops like function calls can be.

Is there any way to benchmark this degradation in a synthetic environment? There is, but it’s not very scientific. You can pollute the global itabTable with entries and continuously trash the L2 CPU cache from a separate Goroutine. This approach can arbitrarily increase the method call overhead on any generic code being benchmarked, but it’s really hard to create a contention pattern in the itabTable that accurately matches what we’d see in a live production service, so the measured overheads are hard to translate into more realistic environments.

Nonetheless, the behavior observed in these benchmarks is still quite interesting. This is the result of a micro-benchmark measuring the method call overhead (in nanoseconds per call) for the different possible method call codegens in Go 1.18. The method being tested has a non-inlined empty body, so this is strictly measuring the call overhead. The benchmark is run three times: in a vacuum, with continuous thrashing of the L2 cache, and with thrashing and a greatly enlarged global itabTable that contains collisions for the itab we’re looking for.

We can see that the method call overhead in a vacuum scales similarly to what we saw in our Escape benchmarks. The interesting behavior happens when we add contention: as we’d expect, the performance of non-generic method calls is not affected by contending for the L2 cache, while there is a small increase in the overhead of all generic code (even the one that doesn’t access the global itabTable – most likely because of the larger runtime dictionaries that must be accessed for all Generic method calls). The truly disastrous combination happens when we increase the size of of the itabTable together with the L2 cache trashing: it introduces a massive amount of overhead behind every method call, because the global itabTable is too large to fit in cache, and the relevant entries are no longer warm. Again, the exact amount of overhead cannot be meaningfully discerned from this micro-benchmark. It depends on the complexity and load of your Go application in production. The important take-away from this experiment is that this spooky action-at-a-distance exists in Generic Go code, so be careful about it, and measure for your use case.

Byte sequences

There’s a very common and recurring pattern in Go codebases, which can even be seen throughout the standard library, where a function that takes a []byte slice as its argument will also have an identical equivalent that takes an string instead.

We can find this pattern all over the place (e.g. (*Buffer).Write vs (*Buffer).WriteString), but the encoding/utf8 package really is a shining example of where this starts becoming an issue: roughly 50% of its API surface are duplicated methods that have been manually monomorphized to support both []byte and string.

Bytes	String
`DecodeLastRune`	`DecodeLastRuneInString`
`DecodeRune`	`DecodeRuneInString`
`FullRune`	`FullRuneInString`
`RuneCount`	`RuneCountInString`
`Valid`	`ValidString`

It’s worth pointing out that this duplication is, in fact, a performance optimization: the API could very well provide only []byte functions to operate on UTF8 data, forcing the users to convert their string inputs to []byte before calling into the package. This would not be particularly un-ergonomic, but it would be very expensive. Since byte slices in Go are mutable and strings are not, converting between them in either direction always^[1] forces an allocation.

This significant amount of code duplication sure looks like a juicy target for Generics, but since the code was duplicated in the first place to prevent extra allocations, before we attempt to unify the implementations we must ensure that the generated shape instantiations behave as we’d expect them to do.

Let’s compare two different versions of the Valid function: the original one in encoding/utf8 that takes a []byte as its input, and a new, generic one which is constrained with a byteseq, a very simple constraint of string | []byte which should allow us to use the two argument types interchangeably.

"".Valid
"".ValidGeneric[go.shape.[]uint8_0]

For the best experience view on desktop.

Before we look at the shape for our new generic function, there’s a few optimization details in the non-Generic codegen which we ought to review, so we can verify they survive the generic instantiation process. We can see two nice optimizations and another not-so-good one: First, the register-based Go calling convention introduced in 1.16 behaves nicely here with our []byte argument. Instead of being pushed to the stack, the 24 bytes of the slice header that this function receives are passed individually as 3 pointers in 3 registers: the *byte pointer for the slice resides in AX throughout the function body and its length resides in BX, and they never spill. We can see relatively complex expressions such as len(p) >= 8 compile into CMPQ BX, $8 because of this efficient register usage. Likewise, the 32/64 bit load from p is properly optimized into a MOVL + ORL from AX.

The only irksome detail in this compiled function happens in the main for loop: the pi := p[i] load in line 19 has a bounds check that should have been made superfluous by the i < n check in the loop header just above. We can see in the generated assembly that we’re actually chaining two jumps one after the other: a JGE (which is a signed comparison instruction) and a JAE (which is an unsigned comparison instruction). This is an insidious problem, arising from the fact that the return value of len in Go is signed, and probably worth its own blog post.

Either way, the non-Generic codegen for this Valid function is looking quite good overall. Let’s compare it with the generic instantiation! We’re only looking at the shape for a []byte argument here; calling the generic function with a string argument will generate a different shape, as the memory layout of these two is different (16 byte for string, 24 for []byte) even through its usage in the two instantiated shapes would be identical, as we’re accessing the byte sequence in a read-only way.

…And the result is… It’s good! It’s very good actually. We’ve found a use case where Generics can help with code de-duplication without showing a performance regression. This is exciting! Going from top to bottom, we see that all the optimizations hold (and this is also the case for the string shape, not shown here). The register-based calling convention survives the generic instantiation, although do note that the length of our []byte argument now resides in CX instead of BX: all registers have shifted right one slot, because AX is now occupied by the runtime dictionary of the Generics implementation.

Everything else is neat as a pin: the 32/64 bit load is still two instructions, the few bound checks that were elided in the non-Generic version are still elided here, and no additional overhead has been introduced anywhere.

A quick benchmark of the two implementations verifies our reading:

name                             time/op
Valid/Japanese/Bytes-16          2.63µs ± 2%
Valid/Japanese/GenericBytes-16   2.67µs ± 1%
Valid/Japanese/String-16         2.48µs ± 2%
Valid/Japanese/GenericString-16  2.53µs ± 0%
Valid/ASCII/Bytes-16              937ns ± 1%
Valid/ASCII/GenericBytes-16       943ns ± 1%
Valid/ASCII/String-16             930ns ± 3%
Valid/ASCII/GenericString-16      811ns ± 2%

The performance difference between the two implementations is within the margin of error, so this is indeed a best case scenario: the []byte | string constraint can be used in Go generics to reduce code duplication in functions that process byte sequences without introducing any extra overhead. There is one interesting exception here: the Generic shape for string is measurably faster (~4%) than the non-Generic implementation when running the ASCII benchmark, even though their assemblies are functionally identical. However, the shape for []byte has the same performance as the non-Generic code for all benchmarks, again having identical assemblies. This is a puzzling artifact that can be reproduced reliably, only when benchmarking ASCII input.

Function Callbacks

Since its very first release, Go has had very good support for anonymous functions. They are a core part of the language, and they increase its expressiveness by allowing many patterns that would become very verbose without changing the language’s syntax. For instance, user code cannot be extended to allow the range operator being called on a custom struct or interface. This means that to support iteration, our data strutures need to implement custom Iterator structs (with significant overhead), or have an iteration API based on function callbacks, which are often faster. Here’s a small example that uses a function callback to iterate through all the valid runes (i.e. Unicode codepoints) in an UTF-8 encoded byte slice:

func ForEachRune(p []byte, each func(rune)) {
	np := len(p)
	for i := 0; i < np; {
		c0 := p[i]
		if c0 < RuneSelf {
			each(rune(c0))
			i++
			continue
		}
		x := first[c0]
		if x == xx {
			i++ // invalid.
			continue
		}
		size := int(x & 7)
		if i+size > np {
			i++ // Short or invalid.
			continue
		}
		accept := acceptRanges[x>>4]
		if c1 := p[i+1]; c1 < accept.lo || accept.hi < c1 {
			size = 1
		} else if size == 2 {
			each(rune(c0&mask2)<<6 | rune(c1&maskx))
		} else if c2 := p[i+2]; c2 < locb || hicb < c2 {
			size = 1
		} else if size == 3 {
			each(rune(c0&mask3)<<12 | rune(c1&maskx)<<6 | rune(c2&maskx))
		} else if c3 := p[i+3]; c3 < locb || hicb < c3 {
			size = 1
		} else {
			each(rune(c0&mask4)<<18 | rune(c1&maskx)<<12 | rune(c2&maskx)<<6 | rune(c3&maskx))
		}
		i += size
	}
}

Without looking at any benchmarks: how well do you think this function fares compared to a more idiomatic iteration using for _, cp := range string(p)? Right, it doesn’t quite keep up. And the reason for that is because the range loop over a string has the body of its iteration inline, so the best case scenario (a pure ASCII string) can be handled without any function calls. On the other hand, our custom function must issue a callback for every single rune.

If we could somehow inline our each callback for our function, we’d be competitive with a range loop for ASCII strings, and probably even faster for Unicode strings! Alas, what would it take for the Go compiler to inline our callback? It’s a very hard problem to solve in the general case. Think about it. The callback we’re passing is not executed in our local function. It is executed inside of ForEachRune, as part of the iteration. In order for the callback to be inlined inside of the iterator, we’d have to instantiate a copy of ForEachRune with our specific callback. But the Go compiler won’t do that. No sensible compiler is going to generate more than one instance of a pure function. Unless…

Unless we trick the compiler into doing it! Because this sure sounds a lot like monomorphization. There’s a pattern as old as time (or at least as old as C++), which is parametrizing a function over the type of the callback it receives. If you’ve ever worked on a C++ codebase, you’ve probably noticed how functions that take callbacks are often generic, with the type of the function callback as a parameter. When the enclosing function is monomorphized, the specific callback for that function invocation is replaced into the IR, and it often becomes trivial to inline – particularly if it’s a pure function (i.e. a callback that does not capture any arguments). Because of this reliable optimization, the combination of lambdas and templates has become a cornerstone zero-cost abstraction in modern C++. It adds a lot of expressiveness to a language just as orthopedic as Go, enabling iteration and other functional constructs without introducing new language syntax nor runtime overhead.

The question is: can we do the same in Go? Can we parametrize a function based on its callback? It turns out we can, although interestingly this is not explained in any of the Generics documentation I’ve found. We can rewrite the signature of our iterator function like this, and it actually compiles and runs:

func ForEachRune[F func(rune)](p []byte, each F) {
	// ...
}

Yes, you can use a func signature as a Generic constraint. Constraints do not necessarily need to be an interface. That is something worth keeping in mind.

As for the result of this optimization attempt, I’m not going to include the diassembly here, but if you’ve been following along so far, you’ve probably already guessed that this serves no purpose whatsoever. The shape of the generic function that is instantiated is not specific to our callback. It is a generic shape for a func(rune) callback, that doesn’t allow any kind of inlining. This is another example where more aggressive monomorphization would open up a very interesting optimization opportunity.

So, is that it? Nothing interesting to do with function callbacks? Well, not quite. It turns out that the Go compiler has gotten pretty good with inlining since its 1.0 release. It can do some very powerful stuff these days – when generics don’t get in the way.

Let me show you an example: imagine we’re working on a library to add functional constructs to Go. Why would we do that? I don’t know. Lots of people seem to be doing it. Maybe because it’s trendy. So let’s start with a simple case, a ‘Map’ function that calls a callback on each element of a slice and stores its result in place.

"".MapInt
"".MapAny[go.shape.int_0]
"".MapAny[go.shape.int_0,go.shape.func(int) int_1]

For the best experience view on desktop.

Before we jump into the Generic map (which is the interesting case), let’s look at a MapInt hardcoded to int slices, to see what the Go compiler can do with this code. It turns out it can do a lot: the assembly for MapInt is looking very good. We can see that there are no CALLs in the main IntMapTest from our example: we go straight from loading the global input1 slice to iterating on it, and the mapping operation (a simple multiplication in this case) is performed inline with a single instruction. The function has been fully flattened, and both MapInt and the anonymous callback inside of IntMapTest are gone from the codegen.

Should we be impressed about this code generation? It’s a very trivial case after all. Maybe “impressed” is not the right word, but if you’ve been paying attention to the evolution of Go performance over the last decade, you should at least be quite excited!

You see, the simple MapInt function from this example is actually a stress test for the inline heuristics in the Go compiler: it is not a leaf function (because it calls another function inside of it) and it contains a for loop with a range. These two details would have made the function impossible to optimize for every single Go release there’s been until now. Mid-stack inlining wasn’t stabilized until Go 1.10, and inlining functions that contain for loops has been an issue for more than 6 years. In fact, Go 1.18 is the very first release where a range loop can be inlined, so MapInt would look wildly different if it were compiled just a couple months ago.

This is some very exciting progress when it comes to code generation in the Go compiler, so let’s keep celebrating by looking at the Generic implementation of this very same function and… Oh. Oh no. It’s gone now. Well that’s a bummer. The body of MapAny, thanks to mid-stack inlining, has been inlined in its parent function. However, the actual callback, which is now behind a generic shape, has been generated as a free-standing function and must be called explicitly on each iteration of the loop.

Let’s not despair: what if we attempt the same pattern we’ve just discussed, parametrizing over the type of the callback? That actually does the trick! We’re back to a fully flattened function, but do note that this ain’t magic. Inlining is after all an heuristic, and in this particular example, we’ve tickled the heuristic the right way. Since our MapAny function is simple enough that its whole body can be inlined, all we needed is to add more specificity to the shape of our Generic function. If the callback to our function is not a callback to a generic shape, but a monomorphized instance of a func(rune) callback, that will allow the Go compiler to flatten the whole call. Do you see where I’m going? In this example, inlining the function body is a very special kind of monomorphization. A very aggressive one, because the shape it instantiates is actually a full monomorphization: it can’t be anything else, because the enclosing function is not generic! And when you fully monomorphize code, the Go compiler is capable of performing very interesting optimizations.

To summarize: if you’re writing functional helpers that use callbacks, such as Iterators or Monads, you want to parametrize them on the type of their callbacks. If and only if the helper itself is simple enough be fully inlined, the extra parametrization will tickle the inliner into fully flattening the call, which is exactly what you want for a functional helper. However, if your helper is not simple enough to be inlined, the parametrization will be pointless. The instantiated generic shape will be too coarse to perform any optimization.

Lastly, let me point out that even though this full monomorphization example is probably not reliable in all cases, it does hint at something very hopeful: that the Go compiler has gotten very good at inlining, and that if it gets to deal with very specific code instantiations, it is capable of generating very good assembly. There’s an ocean of optimization opportunities already implemented in the Go compiler that are just waiting for a little push from the Generics implementation to start shining.

Conclusions

This was a lot of fun! I hope you had a lot of fun too looking at these assemblies with me. Let’s finish this post with a short list of DOs and DON’Ts when it comes to performance and Generics in Go 1.18:

DO try to de-duplicate identical methods that take a string and a []byte using a ByteSeq constraint. The generated shape instantiation is very close to manually writing two almost-identical functions.
DO use generics in data structures. This is by far their best use case: Generic data structures that were previously implemented using interface{} are complex and un-ergonomic. Removing the type assertions and storing types unboxed in a type-safe way makes these data structures both easier to use and more performant.
DO attempt to parametrize functional helpers by their callback types. In some cases, it may allow the Go compiler to flatten them.
DO NOT attempt to use Generics to de-virtualize or inline method calls. It doesn’t work because there’s a single shape for all pointer types that can be passed to the generic function; the associated method information lives in a runtime dictionary.
DO NOT pass an interface to a generic function, under any circumstances. Because of the way shape instantiation works for interfaces, instead of de-virtualizing, you’re adding another virtualization layer that involves a global hash table lookup for every method call. When dealing with Generics in a performance-sensitive context, use only pointers instead of interfaces.
DO NOT rewrite interface-based APIs to use Generics. Given the current constraints of the implementation, any code that currently uses non-empty interfaces will behave more predictably, and will be simpler, if it continues using interfaces. When it comes to method calls, Generics devolve pointers into twice-indirect interfaces, and interfaces into… well, something quite horrifying, if I’m being honest.
DO NOT despair and/or weep profusely, as there is no technical limitation in the language design for Go Generics that prevents an (eventual) implementation that uses monomorphization more aggressively to inline or de-virtualize method calls.

Ah well. Overall, this may have been a bit of a disappointment to those who expected to use Generics as a powerful option to optimize Go code, as it is done in other systems languages. We have learned (I hope!) a lot of interesting details about the way the Go compiler deals with Generics. Unfortunately, we have also learned that the implementation shipped in 1.18, more often than not, makes Generic code slower than whatever it was replacing. But as we’ve seen in several examples, it needn’t be this way. Regardless of whether we consider Go as a “systems-oriented” language, it feels like runtime dictionaries was not the right technical implementation choice for a compiled language at all. Despite the low complexity of the Go compiler, it’s clear and measurable that its generated code has been steadily getting better on every release since 1.0, with very few regressions, up until now.

From reading the Risks section in the original proposal for full monomorphization in Go 1.18, it appears the choice of implementing Generics with dictionaries was made because monomorphizing code is slow. But this raises the question: is it, though? How could anybody possibly know that monomorphizing Go code is slow? It has never been done before! In fact, there’s never been any Generic Go code to monomorphize. It feels like a strong guiding factor behind this complex techincal choice were potentially misleading assumptions, which we all hold, such as “monomorphizing C++ code is slow”. This, again, raises the question: is it, though? How much of the C++ compilation overhead comes from monomorphization, as opposed to the performance nightmare that is C++ include processing, or the many optimization passes that are applied on top of the monomorphized code? Would the terrible performance characteristics of C++ template instantiation also apply to the Go compiler, with many fewer optimization passes and a clean module system that could prevent a lot of redundant code generation? And what would the actual performance impact be when compiling large Go projects such as Kubernetes or Vitess? Surely the answer will depend on how often and where are Generics used in these codebases. These are all things that we can start measuring now, but that couldn’t be measured earlier. Likewise, we can now measure the performance impact of stenciling + dictionaries in real world code, like we’re doing in this analysis, and see that we’re paying a hefty performance tax in our programs to speed up the Go compiler.

Considering what we now know, and the limitations that this Generics implementation imposes on its adoption in performance-sensitive code, I can only hope that the choice of using runtime dictionaries to cut on compilation times will be re-evaluated, and that more aggressive monomorphization will come in future Go releases. Introducing Generics into Go has been a titanic task, and while the design of this ambitious feature has been succesful by any measure, the complexity it introduces into the language warrants an equally amibitious implementation. One that can be used in as many contexts as possible, without runtime overhead, and that enables not only parametric polymorphism but deeper optimizations from which a lot of real-world Go applications would benefit.

[1]
This is actually not true. The Go compiler has a few transformations that can prevent allocations when converting from []byte to string. Most notably, given var b []byte, you can iterate through the UTF8 codepoints in b with for i, cp := range string(b), and that conversion will not force an allocation. Likewise, you can look up a value in a map where the keys are string using a byte slice: x = m[string(b)] will not allocate. The inverse, m[string(b)] = x will allocate because the map needs to take ownership of the string key.undefined

Generics can make your Go code slower

The generics implementation in 1.18

Interface inlining

Intermission: Calling an interface method in Go

Back to generics: Pointer calls

Generic interface calls

Byte sequences

Function Callbacks

Conclusions

Share

Next post