Profiling memory usage in MySQL

By Ben Dicken | April 11, 2024

When considering the performance of any software, there's a classic trade-off between time and space. In the process of assessing the performance of a MySQL query, we often focus on execution time (or query latency) and use it as the primary metric for query performance. This is a good metric to use, as ultimately, we want to get query results as quickly as possible.

I recently released a blog post about how to identify and profile problematic MySQL queries, with a discussion centered around measuring poor performance in terms of execution time and row reads. However, in this discussion, memory consumption was largely ignored.

Though it may not be needed as often, MySQL also has built-in mechanisms for gaining a deep understanding of both how much memory a query is using and also what that memory is being used for. Let's take a deep dive through this functionality and see how we can perform live monitoring of the memory usage of a MySQL connection.

Memory statistics

In MySQL, there are many components of the system that can be individually instrumented. The performance_schema.setup_instruments table lists each of these components, and there are quite a few:

SELECT count(*) FROM performance_schema.setup_instruments;

+----------+
| count(*) |
+----------+
| 1255     |
+----------+

Included in this table are a number of instruments that can be used for memory profiling. To see what is available, try selecting from the table and filtering by memory/.

SELECT name, documentation
  FROM performance_schema.setup_instruments
  WHERE name LIKE 'memory/%';

You should see several-hundred results. Each of these represent a different category of memory that can be individually instrumented in MySQL. Some of these categories contain a short bit of documentation describing what this memory category represents or is used for. If you'd like to see only memory types that have a non-null documentation value, you can run:

SELECT name, documentation
  FROM performance_schema.setup_instruments
  WHERE name LIKE 'memory/%'
  AND documentation IS NOT NULL;

Each of these memory categories can be sampled at several different granularities. The various levels of granularity are stored across several tables:

SELECT table_name
  FROM information_schema.tables
  WHERE table_name LIKE '%memory_summary%'
  AND table_schema = 'performance_schema';

+-----------------------------------------+
| TABLE_NAME                              |
+-----------------------------------------+
| memory_summary_by_account_by_event_name |
| memory_summary_by_host_by_event_name    |
| memory_summary_by_thread_by_event_name  |
| memory_summary_by_user_by_event_name    |
| memory_summary_global_by_event_name     |
+-----------------------------------------+

memory_summary_by_account_by_event_name Summarizes memory events based on accounts (An account is a combination of a user and host)
memory_summary_by_host_by_event_name Summarizes memory events at a host granularity
memory_summary_by_thread_by_event_name Summarizes memory events at a MySQL thread granularity
memory_summary_by_user_by_event_name Summarizes memory events at a user granularity
memory_summary_global_by_event_name A global summary of memory statistics

Notice that there is no specific tracking for memory usage at a per-query level. However, this does not mean we cannot profile the memory usage of a query! To accomplish this, we can monitor the usage of memory on whatever connection the query of interest is being executed on. Because of this, we'll focus our use on the memory_summary_by_thread_by_event_name table, as there is a convenient mapping between a MySQL connection and a thread.

Finding usage for a connection

At this point, you should set up two separate connections to your MySQL server on the command line. The first is the one that will execute the query you want to monitor memory usage for. The second will be used for monitoring purposes.

On the first connection, run these queries to get your connection ID and thread ID.

SET @cid = (SELECT CONNECTION_ID());
SET @tid = (SELECT thread_id
    FROM performance_schema.threads
    WHERE PROCESSLIST_ID=@cid);

Then grab these values. Of course, yours will likely look different than what you see here.

SELECT @cid, @tid;

+------+------+
| @cid | @tid |
+------+------+
|   49 |   89 |
+------+------+

Next up, execute some long-running query you'd like to profile the memory usage for. For this example, I'll do a large SELECT from a table that has 100 million rows in it, which should take awhile since there is no index on the alias column:

SELECT alias FROM chat.message ORDER BY alias DESC LIMIT 100000;

Now, while this is executing, switch over to your other console connection and run the following, replacing the thread ID with the one from your connection:

SELECT
    event_name,
    current_number_of_bytes_used
FROM performance_schema.memory_summary_by_thread_by_event_name
WHERE thread_id = YOUR_THREAD_ID
ORDER BY current_number_of_bytes_used DESC

You should see results along the lines of this, though the details will depend highly on your query and data:

+---------------------------------------+------------------------------+
| event_name                            | current_number_of_bytes_used |
+---------------------------------------+------------------------------+
| memory/sql/Filesort_buffer::sort_keys | 203488                       |
| memory/innodb/memory                  | 169800                       |
| memory/sql/THD::main_mem_root         | 46176                        |
| memory/innodb/ha_innodb               | 35936                        |
...

This indicates the amount of memory for each category being used at the exact moment this query was executed. If you run this query several times while the other SELECT alias... query is executing, you may see differences in the results, as memory usage for a query is not necessarily constant over its whole execution. Each execution of this query represents a sample at a moment of time. Thus, if we want to see how the usage changes over time, we'll need to take many samples.

The documentation for memory/sql/Filesort_buffer::sort_keys is missing from the performance_schema.setup_instruments table.

SELECT name, documentation
    FROM performance_schema.setup_instruments
    WHERE name LIKE 'memory%sort_keys';

+---------------------------------------+---------------+
| name                                  | documentation |
+---------------------------------------+---------------+
| memory/sql/Filesort_buffer::sort_keys | <null>        |
+---------------------------------------+---------------+

However, the name indicates that it is memory being used for sorting data from a file. This makes sense as a large part of the expense of this query would be sorting the data so that is can be displayed in descending order.

Collecting usage over time

As a next step, we need to be able to sample this memory usage over time. For short queries this will not be as useful, as we'll only be able to execute this query once, or a small number of times while the profiled query is executing. This will be more useful for longer-running queries, ones that take multiple seconds or minutes. These, would be the types of queries we'd want to profile anyways, as these are the ones likely to use a large portion of memory.

This could be implemented fully in SQL and invoked via a stored procedure. However, in this case, let's use a separate script in Python to provide monitoring.

#!/usr/bin/env python3

import time
import MySQLdb
import argparse

MEM_QUERY='''
SELECT event_name, current_number_of_bytes_used
  FROM performance_schema.memory_summary_by_thread_by_event_name
  WHERE thread_id = %s
  ORDER BY current_number_of_bytes_used DESC LIMIT 4
'''

parser = argparse.ArgumentParser()
parser.add_argument('--thread-id', type=int, required=True)
args = parser.parse_args()

dbc = MySQLdb.connect(host='127.0.0.1', user='root', password='password')
c = dbc.cursor()

ms = 0
while(True):
    c.execute(MEM_QUERY, (args.thread_id,))
    results = c.fetchall()
    print(f'\n## Memory usage at time {ms} ##')
    for r in results:
        print(f'{r[0][7:]} -> {round(r[1]/1024,2)}Kb')
    ms+=250
    time.sleep(0.25)

This is a simple, first stab at such a monitoring script. In summary, this code does the following:

Get the provided thread ID to monitor via command line
Set up a connection to a MySQL database
Every 250 milliseconds, execute a query to get the top 4 used memory categories and print a readout

This could be adjusted in many ways depending on your profiling needs. For example, tweaking the frequency of the ping to the server or changing how many memory categories are listed per iteration. Running this while a query is executing provides results like this:

...
## Memory usage at time 4250 ##
innodb/row0sel -> 25.22Kb
sql/String::value -> 16.07Kb
sql/user_var_entry -> 0.41Kb
innodb/memory -> 0.23Kb

## Memory usage at time 4500 ##
innodb/row0sel -> 25.22Kb
sql/String::value -> 16.07Kb
sql/user_var_entry -> 0.41Kb
innodb/memory -> 0.23Kb

## Memory usage at time 4750 ##
innodb/row0sel -> 25.22Kb
sql/String::value -> 16.07Kb
sql/user_var_entry -> 0.41Kb
innodb/memory -> 0.23Kb

## Memory usage at time 5000 ##
innodb/row0sel -> 25.22Kb
sql/String::value -> 16.07Kb
sql/user_var_entry -> 0.41Kb
innodb/memory -> 0.23Kb
...

This is great, but there's a few weaknesses. It would be nice to see more than the top 4 memory usage categories, but increasing that numbers increases the size of this already-large output dump. It would also be nice to have an easier way to get a picture of the memory usage at-a-glance via some visualizations. This could be done by having the script dump the results to a CSV or JSON, and then loading them up later in a visualization tool. Even better, we could plot the results we are getting live, as the data is streaming in. This provides a more up-to-date view, and allows us to observe the memory usage live as it is happening, all in one tool.

Plotting memory usage

In order make this tool even more useful and provide visualizations, a few changes are going to be made.

The user will provide connection ID on the command line, and the script will be responsible for finding the underlying thread.
The frequency at which the script requests memory data will be configurable, also via the command line.
The matplotlib library will be used to generate a visualization of the memory usage. This will consist of a stack plot with a legend showing the top memory usage categories, and will retain the past 50 samples.

It's quite a bit of code, but is included here for the sake of completeness.

#!/usr/bin/env python3

import matplotlib.pyplot as plt
import numpy as np
import MySQLdb
import argparse

MEM_QUERY='''
SELECT event_name, current_number_of_bytes_used
  FROM performance_schema.memory_summary_by_thread_by_event_name
  WHERE thread_id = %s
  ORDER BY event_name DESC'''

TID_QUERY='''
SELECT  thread_id
  FROM performance_schema.threads
  WHERE PROCESSLIST_ID=%s'''

class MemoryProfiler:

    def __init__(self):
        self.x = []
        self.y = []
        self.mem_labels = ['XXXXXXXXXXXXXXXXXXXXXXX']
        self.ms = 0
        self.color_sequence = ['#ffc59b', '#d4c9fe', '#a9dffe', '#a9ecb8',
                               '#fff1a8', '#fbbfc7', '#fd812d', '#a18bf5',
                               '#47b7f8', '#40d763', '#f2b600', '#ff7082']
        plt.rcParams['axes.xmargin'] = 0
        plt.rcParams['axes.ymargin'] = 0
        plt.rcParams["font.family"] = "inter"

    def update_xy_axis(self, results, frequency):
        self.ms += frequency
        self.x.append(self.ms)
        if (len(self.y) == 0):
            self.y = [[] for x in range(len(results))]
        for i in range(len(results)-1, -1, -1):
            usage = float(results[i][1]) / 1024
            self.y[i].append(usage)
        if (len(self.x) > 50):
            self.x.pop(0)
            for i in range(len(self.y)):
                self.y[i].pop(0)

    def update_labels(self, results):
        total_mem = sum(map(lambda e: e[1], results))
        self.mem_labels.clear()
        for i in range(len(results)-1, -1, -1):
            usage = float(results[i][1]) / 1024
            mem_type = results[i][0]
            # Remove 'memory/' from beginning of name for brevity
            mem_type = mem_type[7:]
            # Only show top memory users in legend
            if (usage < total_mem / 1024 / 50):
                mem_type = '_' + mem_type
            self.mem_labels.insert(0, mem_type)

    def draw_plot(self, plt):
        plt.clf()
        plt.stackplot(self.x, self.y, colors = self.color_sequence)
        plt.legend(labels=self.mem_labels, bbox_to_anchor=(1.04, 1), loc="upper left", borderaxespad=0)
        plt.xlabel("milliseconds since monitor began")
        plt.ylabel("Kilobytes of memory")

    def configure_plot(self, plt):
        plt.ion()
        fig = plt.figure(figsize=(12,5))
        plt.stackplot(self.x, self.y, colors=self.color_sequence)
        plt.legend(labels=self.mem_labels, bbox_to_anchor=(1.04, 1), loc="upper left", borderaxespad=0)
        plt.tight_layout(pad=4)
        return fig

    def start_visualization(self, database_connection, connection_id, frequency):
        c = database_connection.cursor();
        fig = self.configure_plot(plt)
        while(True):
            c.execute(MEM_QUERY, (connection_id,))
            results = c.fetchall()
            self.update_xy_axis(results, frequency)
            self.update_labels(results)
            self.draw_plot(plt)
            fig.canvas.draw_idle()
            fig.canvas.start_event_loop(frequency / 1000)

def get_command_line_args():
    '''
    Process arguments and return argparse object to caller.
    '''
    parser = argparse.ArgumentParser(description='Monitor MySQL query memory for a particular connection.')
    parser.add_argument('--connection-id', type=int, required=True,
                        help='The MySQL connection to monitor memory usage of')
    parser.add_argument('--frequency', type=float, default=500,
                        help='The frequency at which to ping for memory usage update in milliseconds')
    return parser.parse_args()

def get_thread_for_connection_id(database_connection, cid):
    '''
    Get a thread ID corresponding to the connection ID
    PARAMS
      database_connection - Database connection object
      cid - The connection ID to find the thread for
    '''
    c = database_connection.cursor()
    c.execute(TID_QUERY, (cid,))
    result = c.fetchone()
    return int(result[0])

def main():
    args = get_command_line_args()
    database_connection = MySQLdb.connect(host='127.0.0.1', user='root', password='password')
    connection_id = get_thread_for_connection_id(database_connection, args.connection_id)
    m = MemoryProfiler()
    m.start_visualization(database_connection, connection_id, args.frequency)
    connection.close()

if __name__ == "__main__":
    main()

With this, we can do detailed monitoring of executing MySQL queries. To use it, first get the connection ID for the connection you want to profile:

SELECT CONNECTION_ID();

Then, executing the following will begin a monitoring session:

./monitor.py --connection-id YOUR_CONNECTION_ID --frequency 250

When executing a query on the database, we can observe the increase in memory usage, and see what categories of memory are the largest contributors.

Memory profile visualization example

This visualization can also help us to clearly see what kinds of operations are memory hogs. For example, here is a snippet of a memory profile for creating a FULLTEXT index on a large table:

Memory profile visualization example

The memory usage is significant, and continues to grow into using hundreds of megabytes as it executes.

Note

For another example of how you can use MySQL to profile memory usage, see check out this DBAMA presentation and the corresponding GitHub repository.

Conclusion

Though it may not be needed as often, having the ability to get detailed memory usage information can be extremely valuable when the need for detailed query optimization arises. Doing this can reveal when and why MySQL may be cause memory pressure on the system, or if a memory upgrade for your database server may be needed. MySQL provides a number of primitives that you can build upon to develop profiling tooling for your queries and workload.