Anyone that has worked with large databases can testify how slow queries can get. This is often due to the necessary indexes not being there, or something in the query that stops the database system from using the index. Choosing the right indexes to use, and the right order to fetch data in, proves to be the difference between a 10ms and 5s query.
Choosing the indexes and join order is called query planning. The output of this process is a query plan that tells the database system how to answer a query from a user. For simple queries with a single table, it’s often trivial to find the optimal query plan. But for large queries with lots of tables and lots of indexes, the available options can quickly run into the thousands and even millions of alternatives. Most of these alternatives are really slow, so the planner's job is to find the best possible query plan among all possibilities.
Most people are more familiar with compilers than with query planners, so I thought I should compare the work of a query planner with the work of a compiler.
A compiler is a program that takes source code written in a programming language and translates it into machine code that can be executed by a computer's processor. A query planner does something similar. The input is code written in SQL (or some other database query language), and the output is a query plan that describes which indexes will be used, and in which order to access tables.
The typical phases of a compiler/planner are: lexing and parsing, semantic analysis, optimization, and code generation. Let’s look at each of these individually to understand the similarities and the differences between a compiler and a query planner.
In the first phase, lexing and parsing, the source code is analyzed and divided into a sequence of tokens, which are basic units such as keywords, operators, and identifiers. The sequence of tokens generated by the lexical analyzer is analyzed and checked for correctness according to the rules of the programming language. This phase typically involves building a syntax tree, which is a hierarchical representation of the structure of the source code. The output of this step is an abstract syntax tree (AST). There is no interesting difference between a compiler and a planner here.
As an example, let’s look at the following query:
SELECT name, avg(salary) FROM employees JOIN salary_info ON id = empid
The AST would look something like this:
It’s the same query, but instead of a string, it’s now this tree data structure. All the unnecessary parts have been stripped away — the planner doesn’t care if the user wrote “
SELECT” or “
select”, or any whitespaces in the query.
The semantic analysis phase of compilation is where the compiler checks for semantic errors in the input source code. Semantic errors are errors that are not detected during the lexical analysis or syntax analysis phases, but which can only be detected by analyzing the meaning of the source code.
During semantic analysis, the compiler performs a variety of checks to ensure that the source code is semantically correct. For example, the compiler may check for type mismatches, in which a value of one type is used in a context where a value of a different type is expected. The compiler may also check for undefined variables or other entities, such as functions or classes, and may perform additional checks and transformations on the syntax tree generated during syntax analysis.
A query planner does almost exactly the same thing here. Instead of searching for classes and methods, it would bind to tables and columns, but the idea is the same.
After semantic analysis, the data structures representing the query will be enriched with information such as which table a column comes from, what types the columns and expressions in the query have, etc.
During the optimization phase, the compiler will now take all the information gathered during parsing and semantic analysis and iteratively change it to a more optimal form. This is often done using an intermediate representation of the query. Instead of staying in a shape that is close to the input language, the intermediate representation is custom made to make optimisations easier and faster to do.
In this step, the query planner uses a variety of algorithms and techniques to determine the most efficient way to execute the query, considering factors such as the available indexes, the data distribution, and the overall structure of the database. This may involve selecting the most efficient algorithms for operations such as joins and sorting, and choosing the most appropriate indexes to use. It usually also does some of the optimizations that a compiler would perform, such as constant folding. These types of optimizations are about rewriting the input into an equivalent form that is easier for the planner to optimize.
An example of this is how the Vitess planner massages predicates into a shape that can be solved using an index. Given a predicate such as:
WHERE (id = 5 AND name = 'Toto') OR (id = 5 AND name = 'Mumin')
The OR in the middle here makes it hard for the planner to use an index on id to find the correct row. The optimizer will rewrite the predicate into something that is easier to optimize but still means the same thing.
WHERE id = 5 AND (name = 'Toto' OR name = 'Mumin')
Let us pause here and talk about why the order of table access is so important. Say we want to join three tables: A with B, and B with C. We could start by joining A with B, and take the output of that and join it with C. Or we can start from the other side - join B with C and then join that result with A. The intermediate state needed is where the big difference comes in. If AxB is very large, joining that with C will be very slow, compared to if we start with BxC that happens to be pretty small. It’s a path finding problem.
Here is a diagram of the tables used in the TPC-H query #8. The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. It’s a well known dataset used to test the strength of database systems, in particular the query planner.
All the tables need to be visited, and the connections between tables have different costs. The planner can start at any (*) node. What is the path that touches all tables with the least cost? This is why the join order is so important.
In Vitess, our query plans are partly executed on the SQL proxy layer, called VTGate, and partly on the individual shards. Probably the most important optimization we perform is to push down as much work as possible to MySQL. If we can perform a join or a filter in MySQL, that is always going to be faster than fetching all the individual rows and performing the same operation on the VTGate side. So, during query planning, we are searching for the query plan that has the least number of network calls.
When planning aggregations, our strategy is to do as much aggregation as possible in MySQL, and then aggregate the aggregates. The planner rewrites the aggregation that the user asked for into smaller aggregations and sends those to MySQL. The results of these queries are then used as inputs and summarized into the final aggregation result. You can read more about grouping and aggregations in an earlier blog post.
In the code generation phase, the compiler generates machine code based on the input source code and the analysis performed in previous phases. This machine code can then be executed by the computer's processor.
The query planner generates a plan that specifies the exact steps that the database engine should take to execute the query. This plan may include operations such as index scans, join algorithms, and sorting algorithms, as well as other details such as the order in which the operations should be performed.
Query planners are an essential component of database management systems, and the work that query planner developers do plays a crucial role for database systems. The field of query planning is an active area of research, with new algorithms and techniques being developed all the time. A good query planner can have a direct impact on the performance and efficiency of databases, which can have real-world benefits for the organizations and users that rely on those databases.