Indexing JSON in MySQL
By Aaron Francis |
MySQL gave us the JSON data type back in mid-2015 with the release of MySQL 5.7.8. Since then, it has been used as a way to escape rigid column definitions and store JSON documents of all shapes and sizes: audit logs, configuration settings, 3rd party payloads, user-defined fields, and more.
Although MySQL gives us functions for reading and writing JSON data, you’ll quickly discover something that is conspicuously missing: the ability to directly index your JSON columns.
In other databases, the best way to directly index a JSON column is usually through a type of index known as a Generalized Inverted Index, or GIN for short. Since MySQL doesn’t offer GIN indexes, we’re unable to directly index an entire stored JSON document. All is not lost though, because MySQL does give us a way to indirectly index parts of our stored JSON documents.
Depending on the version of MySQL that you're using, you have two options for indexing JSON. In MySQL 5.7 you would have to create an intermediate generated column, but starting in MySQL 8.0.13, you can create a functional index directly.
Let’s start with a example table used for logging various actions taken in an application.
CREATE TABLE `activity_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `properties` json NOT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`) )
Into that table we’ll insert JSON documents that have this shape:
{ "uuid": "e7af5df8-f477-4b9b-b074-ad72fe17f502", "request": { "email": "little.bobby@tables.com", "firstName": "Little", "formType": "vehicle-inquiry", "lastName": "Bobby", "message": "Hello, can you tell me what the specs are for this vehicle?", "postcode": "75016", "townCity": "Dallas" } }
In our example, we’ll be indexing the email
key inside the request
object. This will allow our (fictional) users to quickly find forms submitted by specific people.
Let’s take a look at our first option for indexing: generated columns.
Indexing JSON via a generated column
A generated column can be thought of as a calculated, computed, or derived column. It is a column whose value is the result of an expression, rather than direct data input. The expression can contain literal values, built-in functions, or references to other columns. The result of the expression must be scalar and deterministic.
Since we’re trying to index the request.email
field in the properties
column, our generated column will use the JSON unquoting extraction operator to pluck the value out.
To verify that we’ve formed our expression correctly, we’ll first run a SELECT
statement and inspect the results.
mysql> SELECT properties->>"$.request.email" FROM activity_log; +--------------------------------+ | properties->>"$.request.email" | +--------------------------------+ | little.bobby@tables.com | +--------------------------------+
The ->>
operator is a shorthand, unquoting extraction operator, making it equivalent to JSON_UNQUOTE(JSON_EXTRACT(column, path))
. We could have written the previous SELECT
statement using the longhand and gotten the same result.
mysql> SELECT JSON_UNQUOTE(JSON_EXTRACT(properties, "$.request.email")) -> FROM activity_log; +-----------------------------------------------------------+ | JSON_UNQUOTE(JSON_EXTRACT(properties, "$.request.email")) | +-----------------------------------------------------------+ | little.bobby@tables.com | +-----------------------------------------------------------+
Which method you choose is a matter of personal preference!
Now that we’ve confirmed our expression is valid and accurate, let’s use it to create a generated column.
ALTER TABLE activity_log ADD COLUMN email VARCHAR(255) GENERATED ALWAYS as (properties->>"$.request.email");
The first part of the ALTER
statement should look very familiar, we’re adding a column named email
and defining it as a VARCHAR(255)
. In the latter half of the statement we declare that the column is generated and that it should always be equal to the result of the expression properties->>"$.request.email"
.
We can confirm our column has been added by selecting it as we would any other column.
mysql> SELECT id, email FROM activity_log; +----+-------------------------+ | id | email | +----+-------------------------+ | 1 | little.bobby@tables.com | +----+-------------------------+
You’ll see that MySQL is now maintaining this column for us. If we were to update the JSON value, the generated column value would change as well.
Now that we have our generated column in place, we can add an index to it like we would any other column.
ALTER TABLE activity_log ADD INDEX email (email) USING BTREE;
That’s it! You’ve now indexed the request.email
key in your JSON properties
column. Let’s verify that MySQL would use the index to speed up queries that are filtering on email.
mysql> EXPLAIN SELECT * FROM activity_log WHERE email = 'little.bobby@tables.com'; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: activity_log partitions: NULL type: ref possible_keys: email key: email key_len: 768 ref: const rows: 1 filtered: 100.00 Extra: NULL
MySQL reports that it plans to use the email
index to satisfy this query.
Generated column indexes and the optimizer
MySQL's optimizer is a powerful and mysterious entity. When we give MySQL a command, we’re telling it what we want, not how to get it. Often times MySQL will take our query and rewrite it slightly, which is a good thing! Tens of thousands of hours across dozens of years have gone into making the optimizer effective and efficient.
When it comes to indexes on generated columns, the optimizer can "see through" different access patterns to ensure the underlying index is being used.
We defined an index on email
, which is a generated column based on the expression properties->>"$.request.email"
. We’ve already proven that the index is used when we query against the email
column. What’s more interesting is that the optimizer is smart enough to help us out if we forget to query against the named email
column!
In the following query, we don’t access the generated column by name, but instead use the shorthand JSON extraction operator. (Some rows omitted from the EXPLAIN
statement for brevity.)
mysql> EXPLAIN SELECT * FROM activity_log -> WHERE properties->>"$.request.email" = 'little.bobby@tables.com'; *************************** 1. row *************************** id: 1 possible_keys: email key: email key_len: 768 [...]: [...]
Even though we didn’t explicitly address the column by name, the optimizer understands that there is an index on a generated column based on that expression and opts to use the index. Thanks optimizer!
We can confirm this is the case for the longhand as well.
mysql> EXPLAIN SELECT * from activity_log WHERE -> JSON_UNQUOTE( -> JSON_EXTRACT(properties, "$.request.email") -> ) = 'little.bobby@tables.com'; *************************** 1. row *************************** id: 1 possible_keys: email key: email key_len: 768 [...]: [...]
Again, the optimizer "reads through" our expression and uses the email index.
Not convinced? Let’s take a peek at what the optimizer is doing by running a SHOW WARNINGS
after our previous EXPLAIN
statement to see the rewritten query.
mysql> SHOW WARNINGS; *************************** 1. row *************************** Level: Note Code: 1003 Message: /* select#1 */ select `activity_log`.`id` AS `id`,`activity_log`.`properties` AS `properties`,`activity_log`.`created_at` AS `created_at`,`activity_log`.`email` AS `email` from `activity_log` where (`activity_log`.`email` = 'little.bobby@tables.com')
If you look closely, you’ll see that the optimizer has rewritten our query and changed the equality comparison to reference the indexed column. This is especially useful if you're unable to control the access pattern because the query is being issued from a 3rd party package in your codebase, or you're unable to change this part of your code for some other reason.
If the underlying expression doesn’t match very closely then the optimizer will not be able to use the index, so be sure to take care when creating your generated column. The MySQL documentation explains the optimizer's use of generated column indexes in further detail.
Functional indexes
Beginning with MySQL 8.0.13, you're able to skip the intermediate step of creating a generated column and create what is called a "functional index." The MySQL documentation calls these functional key parts.
A functional index is an index on an expression rather than a column. Sounds a lot like a generated column, doesn’t it? There’s a reason it sounds similar, and that’s because a functional index is implemented using a hidden generated column! We no longer have to create the generated column, but a generated column is still being created.
There are a few gotchas with functional indexes though, especially when it comes to using them for JSON.
It would be nice to create our JSON index like this:
ALTER TABLE activity_log ADD INDEX email ((properties->>"$.request.email")) USING BTREE;
But if you do try that, you get a nasty error:
Query 1 ERROR: Cannot create a functional index on an expression that returns a BLOB or TEXT. Please consider using CAST.
So what’s going on here? In our earlier examples, we were the ones in charge of creating the generated column and we declared it as a VARCHAR(255)
, which is easily indexable by MySQL.
However, when we use a functional index, MySQL is going to create that column for us based on the data type that it infers. JSON_UNQUOTE
returns a LONGTEXT
value, which is not able to be indexed.
Fortunately, the error message points us in the right direction: we need to cast our value to a type that is not LONGTEXT
. Casting using the CHAR
function tells MySQL to infer a VARCHAR
data type.
ALTER TABLE activity_log ADD INDEX email ((CAST(properties->>"$.request.email" as CHAR(255)))) USING BTREE;
Now that we’ve added the index, we’ll see if it works by running an EXPLAIN
.
mysql> EXPLAIN SELECT * FROM activity_log -> WHERE properties->>"$.request.email" = 'little.bobby@tables.com'; *************************** 1. row *************************** id: 1 select_type: SIMPLE table: activity_log partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: Using where
Unfortunately, our index isn’t being considered at all, so we’re not out of the woods yet.
Unless otherwise specified, casting a value to a string sets the collation to utf8mb4_0900_ai_ci
. The JSON extraction functions, on the other hand, return a string with a utf8mb4_bin
collation. Therein lies our problem! Because the collation is mismatched between the query's expression and the stored index, our new functional index isn’t being used.
The final step is to explicitly set the collation of the cast to utf8mb4_bin
.
ALTER TABLE activity_log ADD INDEX email (( CAST(properties->>"$.request.email" as CHAR(255)) COLLATE utf8mb4_bin )) USING BTREE;
Rerunning the previous EXPLAIN
, we can see that we’re finally in a position to use the functional index.
mysql> EXPLAIN SELECT * FROM activity_log -> WHERE properties->>"$.request.email" = 'little.bobby@tables.com'; *************************** 1. row *************************** id: 1 possible_keys: email key: email key_len: 1023 [...]: [...]
Clearly functional indexes come with a few pitfalls, some of which are explicit and easy to debug, and some that require a little bit more digging into the documentation.
Remember that functional indexes use hidden generated columns under the hood. If you prefer to take control of the generated column yourself (even in MySQL 8.0.13 and later) that’s a perfectly reasonable approach!
While direct JSON indexing may not be available in MySQL, indirect indexing of specific keys can cover a majority of use cases.
Don’t just stop with JSON, either! You can use generated columns and functional indexes across all types of common, hard to index patterns.
Go forth and index with confidence.