Using redundant conditions to unlock indexes in MySQL
By Aaron Francis |
When working with MySQL (or any database!), it's essential to understand how indexes work and how they can be used to improve the efficiency of queries. An index is a separate data structure that maintains a copy of part of your data, structured to allow quick data retrieval. Usually, this structure is a B+ Tree. We have an entire post on how indexes work if you want to go into greater detail.
Obfuscated indexes
Creating indexes is only part of the battle. You must also know how to write your queries so that you allow MySQL to use your indexes. One common mistake people make when writing queries is that they obfuscate their indexes. Obfuscating an index simply means that you're hiding the indexed value from MySQL.
Let's say you have a todos
table with a created_at
column that records a timestamp of when the record was created.
CREATE TABLE `todos` ( `id` int NOT NULL AUTO_INCREMENT, `title` varchar(255) NOT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `created_at` (`created_at`) )
In this table, we've added an index to the created_at
column to quickly filter by that timestamp. When we query against the created_at
column to find records created in the last 24 hours, we see that MySQL is using the index as we'd expect:
EXPLAIN SELECT * FROM todos WHERE created_at > NOW() - INTERVAL 24 HOUR; -- | id | type | possible_keys | key | key_len | ref | rows | filtered | Extra | -- |----|-------|---------------|------------|---------|-----|------|----------|-----------------------| -- | 1 | range | created_at | created_at | 4 | | 1 | 100.00 | Using index condition |
However, if we wrap this column in a function, we're obfuscating the column from MySQL, and it can no longer use the index.
EXPLAIN SELECT * FROM todos WHERE YEAR(created_at) = 2023; -- | id | type | possible_keys | key | key_len | ref | rows | filtered | Extra | -- |----|------|---------------|-----|---------|-----|-------|----------|-------------| -- | 1 | ALL | | | | | 39746 | 100.00 | Using where |
By wrapping the created_at
column in a YEAR
function, we're asking MySQL to do an index lookup on YEAR(created_at)
, which is not an index MySQL maintains. It is only maintaining the created_at
index.
In some cases, there are ways around index obfuscation. In this example, we could use a range scan instead of the YEAR
function to obtain the same result.
EXPLAIN SELECT * FROM todos WHERE created_at BETWEEN '2023-01-01 00:00:00' AND '2023-12-31 23:59:59'; -- | id | type | possible_keys | key | key_len | ref | rows | filtered | Extra | -- |----|-------|---------------|------------|---------|-----|------|----------|-----------------------| -- | 1 | range | created_at | created_at | 4 | | 1 | 100.00 | Using index condition |
By unwrapping the created_at
column and changing the comparison to a range scan, we've unlocked the index and allowed MySQL to use it effectively.
Unfortunately, it's not always possible to de-obfuscate your indexes. In some scenarios, you simply cannot avoid wrapping the column in a function. In these cases, you might see if there is a redundant condition that could potentially unlock an existing index.
Redundant conditions in MySQL
A redundant condition is a condition that seems superfluous, extra, or not needed. It is a condition that can be added and removed without changing the results that MySQL returns.
Let's take a look at a contrived example to illustrate the point. In this example, we're selecting the todos
with an id
of less than five.
SELECT * FROM todos WHERE id < 5
In this case, a redundant condition might be id < 10
.
SELECT * FROM todos WHERE id < 5 and id < 10 -- This does... nothing
This is a redundant condition because it does not change the results! Anything with an ID of less than five necessarily has an ID of less than ten also. You can add or remove this condition, and nothing will change. It's also silly to add because it doesn't provide us any benefit.
We're going to expand our todos
table definition a little bit to add due_date
and due_time
columns. (Storing date and time separately is usually not advised, but it helps us prove the point.)
CREATE TABLE `todos` ( `id` int NOT NULL AUTO_INCREMENT, `title` varchar(255) NOT NULL, `due_date` date NOT NULL, `due_time` time NOT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `due_date` (`due_date`), KEY `created_at` (`created_at`) )
Given this table, if you want to query for todos
that are due in the next day, you're stuck using the ADDTIME
function:
SELECT * FROM todos WHERE ADDTIME(due_date, due_time) BETWEEN NOW() AND NOW() + INTERVAL 1 DAY
We do have an index on due_date
, but the index cannot be used because we're performing an operation on it (adding the time). Unlike our previous example, there is no easy way to de-obfuscate this column either since the due_time
is different for every row.
We can confirm that the index is not being used by running an EXPLAIN
on the previous query:
| id | type | possible_keys | key | key_len | ref | rows | filtered | Extra | |----|------|---------------|-----|---------|-----|-------|----------|-------------| | 1 | ALL | | | | | 39746 | 100.00 | Using where |
To work around this, let's add a redundant condition on due_date
alone. When adding the condition, we need to make sure that it's logically impossible to change the result set, which means our redundant condition should be broader than our actual condition.
Since we're looking for todos due in the next 24 hours, we can add a condition that looks for todos due today or tomorrow. That will contain the entire subset of todos that we're looking for and a few that we're not.
EXPLAIN SELECT * FROM todos WHERE -- The real condition ADDTIME(due_date, due_time) BETWEEN NOW() AND NOW() + INTERVAL 1 DAY AND -- The redundant condition due_date BETWEEN CURRENT_DATE AND CURRENT_DATE + INTERVAL 1 DAY
The redundant condition here returns a broader subset of todos than we need, but importantly it allows MySQL to use the index. Running an EXPLAIN
on this query and we see that the due_date
index was used:
| id | type | possible_keys | key | key_len | ref | rows | filtered | Extra | |----|-------|---------------|----------|---------|-----|------|----------|------------------------------------| | 1 | range | due_date | due_date | 3 | | 1 | 100.00 | Using index condition; Using where |
MySQL will first use the index to eliminate most of the table, then the slower ADDTIME
will be used to eliminate the few remaining false positives. The redundant condition is doing its job perfectly!
Domain-specific redundant conditions
Until now, we've been working with redundant conditions that logically cannot change the result set. These are nice because they are easy to reason about and require no further domain knowledge. There are scenarios where you, as a human, might have more knowledge than the database does. (For now, at least.) In those situations, you might be able to add a redundant condition that is not logically incapable of changing the output, but you know, based on your knowledge, that it won't change the output.
In the case of our todos table, let's add an updated_at
column that will be populated with the timestamp of the last time the record was changed.
CREATE TABLE `todos` ( `id` int NOT NULL AUTO_INCREMENT, `title` varchar(255) NOT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `updated_at` timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP PRIMARY KEY (`id`), KEY `created_at` (`created_at`) )
In this scenario, we still only have an index on created_at
, but if we want to query against updated_at
, we might be able to add a redundant condition based on our knowledge of the application. If, given our understanding of the application, we can be sure that created_at
is always equal to or earlier than updated_at
, we can use this to our advantage.
This query, which looks for records that were last modified before January 1st of 2023, will scan the entire table because there is no index on updated_at
:
SELECT * FROM todos WHERE updated_at < '2023-01-01 00:00:00'
This query will return the same results but uses the created_at
index to eliminate records and then filters out the false positives.
SELECT * FROM todos WHERE updated_at < '2023-01-01 00:00:00' AND created_at < '2023-01-01 00:00:00'
The only reason this redundant condition works is because we know that a record cannot be modified before it's created. Depending on your application, you might be able to find more examples of "domain-specific" redundant conditions.
When to use a redundant condition
The optimal indexing strategy always depends on the application, but in general, it's best to have indexes on the conditions you are frequently querying against. Redundant conditions are nice because they require no changes to the database! You can modify the query or the application generating the query, and suddenly everything gets faster. This makes them useful for queries that are only sometimes run or where indexes can't be easily added to the main conditions.
If you'd like to learn more about indexing strategies, we have 17 videos on indexes as a part of our larger course on MySQL for Developers.
If you do end up using the redundant condition strategy, please let us know on Twitter how you did it. We'd love to add more examples to this article!