Profiling memory usage in MySQL
By Benjamin Dicken |
When considering the performance of any software, there's a classic trade-off between time and space. In the process of assessing the performance of a MySQL query, we often focus on execution time (or query latency) and use it as the primary metric for query performance. This is a good metric to use, as ultimately, we want to get query results as quickly as possible.
I recently released a blog post about how to identify and profile problematic MySQL queries, with a discussion centered around measuring poor performance in terms of execution time and row reads. However, in this discussion, memory consumption was largely ignored.
Though it may not be needed as often, MySQL also has built-in mechanisms for gaining a deep understanding of both how much memory a query is using and also what that memory is being used for. Let's take a deep dive through this functionality and see how we can perform live monitoring of the memory usage of a MySQL connection.
Memory statistics
In MySQL, there are many components of the system that can be individually instrumented. The performance_schema.setup_instruments
table lists each of these components, and there are quite a few:
SELECT count(*) FROM performance_schema.setup_instruments; +----------+ | count(*) | +----------+ | 1255 | +----------+
Included in this table are a number of instruments that can be used for memory profiling. To see what is available, try selecting from the table and filtering by memory/
.
SELECT name, documentation FROM performance_schema.setup_instruments WHERE name LIKE 'memory/%';
You should see several-hundred results. Each of these represent a different category of memory that can be individually instrumented in MySQL. Some of these categories contain a short bit of documentation
describing what this memory category represents or is used for. If you'd like to see only memory types that have a non-null documentation
value, you can run:
SELECT name, documentation FROM performance_schema.setup_instruments WHERE name LIKE 'memory/%' AND documentation IS NOT NULL;
Each of these memory categories can be sampled at several different granularities. The various levels of granularity are stored across several tables:
SELECT table_name FROM information_schema.tables WHERE table_name LIKE '%memory_summary%' AND table_schema = 'performance_schema'; +-----------------------------------------+ | TABLE_NAME | +-----------------------------------------+ | memory_summary_by_account_by_event_name | | memory_summary_by_host_by_event_name | | memory_summary_by_thread_by_event_name | | memory_summary_by_user_by_event_name | | memory_summary_global_by_event_name | +-----------------------------------------+
memory_summary_by_account_by_event_name
Summarizes memory events based on accounts (An account is a combination of a user and host)memory_summary_by_host_by_event_name
Summarizes memory events at a host granularitymemory_summary_by_thread_by_event_name
Summarizes memory events at a MySQL thread granularitymemory_summary_by_user_by_event_name
Summarizes memory events at a user granularitymemory_summary_global_by_event_name
A global summary of memory statistics
Notice that there is no specific tracking for memory usage at a per-query level. However, this does not mean we cannot profile the memory usage of a query! To accomplish this, we can monitor the usage of memory on whatever connection the query of interest is being executed on. Because of this, we'll focus our use on the memory_summary_by_thread_by_event_name
table, as there is a convenient mapping between a MySQL connection and a thread.
Finding usage for a connection
At this point, you should set up two separate connections to your MySQL server on the command line. The first is the one that will execute the query you want to monitor memory usage for. The second will be used for monitoring purposes.
On the first connection, run these queries to get your connection ID and thread ID.
SET @cid = (SELECT CONNECTION_ID()); SET @tid = (SELECT thread_id FROM performance_schema.threads WHERE PROCESSLIST_ID=@cid);
Then grab these values. Of course, yours will likely look different than what you see here.
SELECT @cid, @tid; +------+------+ | @cid | @tid | +------+------+ | 49 | 89 | +------+------+
Next up, execute some long-running query you'd like to profile the memory usage for. For this example, I'll do a large SELECT
from a table that has 100 million rows in it, which should take awhile since there is no index on the alias
column:
SELECT alias FROM chat.message ORDER BY alias DESC LIMIT 100000;
Now, while this is executing, switch over to your other console connection and run the following, replacing the thread ID with the one from your connection:
SELECT event_name, current_number_of_bytes_used FROM performance_schema.memory_summary_by_thread_by_event_name WHERE thread_id = YOUR_THREAD_ID ORDER BY current_number_of_bytes_used DESC
You should see results along the lines of this, though the details will depend highly on your query and data:
+---------------------------------------+------------------------------+ | event_name | current_number_of_bytes_used | +---------------------------------------+------------------------------+ | memory/sql/Filesort_buffer::sort_keys | 203488 | | memory/innodb/memory | 169800 | | memory/sql/THD::main_mem_root | 46176 | | memory/innodb/ha_innodb | 35936 | ...
This indicates the amount of memory for each category being used at the exact moment this query was executed. If you run this query several times while the other SELECT alias...
query is executing, you may see differences in the results, as memory usage for a query is not necessarily constant over its whole execution. Each execution of this query represents a sample at a moment of time. Thus, if we want to see how the usage changes over time, we'll need to take many samples.
The documentation for memory/sql/Filesort_buffer::sort_keys
is missing from the performance_schema.setup_instruments
table.
SELECT name, documentation FROM performance_schema.setup_instruments WHERE name LIKE 'memory%sort_keys'; +---------------------------------------+---------------+ | name | documentation | +---------------------------------------+---------------+ | memory/sql/Filesort_buffer::sort_keys | <null> | +---------------------------------------+---------------+
However, the name indicates that it is memory being used for sorting data from a file. This makes sense as a large part of the expense of this query would be sorting the data so that is can be displayed in descending order.
Collecting usage over time
As a next step, we need to be able to sample this memory usage over time. For short queries this will not be as useful, as we'll only be able to execute this query once, or a small number of times while the profiled query is executing. This will be more useful for longer-running queries, ones that take multiple seconds or minutes. These, would be the types of queries we'd want to profile anyways, as these are the ones likely to use a large portion of memory.
This could be implemented fully in SQL and invoked via a stored procedure. However, in this case, let's use a separate script in Python to provide monitoring.
#!/usr/bin/env python3 import time import MySQLdb import argparse MEM_QUERY=''' SELECT event_name, current_number_of_bytes_used FROM performance_schema.memory_summary_by_thread_by_event_name WHERE thread_id = %s ORDER BY current_number_of_bytes_used DESC LIMIT 4 ''' parser = argparse.ArgumentParser() parser.add_argument('--thread-id', type=int, required=True) args = parser.parse_args() dbc = MySQLdb.connect(host='127.0.0.1', user='root', password='password') c = dbc.cursor() ms = 0 while(True): c.execute(MEM_QUERY, (args.thread_id,)) results = c.fetchall() print(f'\n## Memory usage at time {ms} ##') for r in results: print(f'{r[0][7:]} -> {round(r[1]/1024,2)}Kb') ms+=250 time.sleep(0.25)
This is a simple, first stab at such a monitoring script. In summary, this code does the following:
- Get the provided thread ID to monitor via command line
- Set up a connection to a MySQL database
- Every 250 milliseconds, execute a query to get the top 4 used memory categories and print a readout
This could be adjusted in many ways depending on your profiling needs. For example, tweaking the frequency of the ping to the server or changing how many memory categories are listed per iteration. Running this while a query is executing provides results like this:
... ## Memory usage at time 4250 ## innodb/row0sel -> 25.22Kb sql/String::value -> 16.07Kb sql/user_var_entry -> 0.41Kb innodb/memory -> 0.23Kb ## Memory usage at time 4500 ## innodb/row0sel -> 25.22Kb sql/String::value -> 16.07Kb sql/user_var_entry -> 0.41Kb innodb/memory -> 0.23Kb ## Memory usage at time 4750 ## innodb/row0sel -> 25.22Kb sql/String::value -> 16.07Kb sql/user_var_entry -> 0.41Kb innodb/memory -> 0.23Kb ## Memory usage at time 5000 ## innodb/row0sel -> 25.22Kb sql/String::value -> 16.07Kb sql/user_var_entry -> 0.41Kb innodb/memory -> 0.23Kb ...
This is great, but there's a few weaknesses. It would be nice to see more than the top 4 memory usage categories, but increasing that numbers increases the size of this already-large output dump. It would also be nice to have an easier way to get a picture of the memory usage at-a-glance via some visualizations. This could be done by having the script dump the results to a CSV or JSON, and then loading them up later in a visualization tool. Even better, we could plot the results we are getting live, as the data is streaming in. This provides a more up-to-date view, and allows us to observe the memory usage live as it is happening, all in one tool.
Plotting memory usage
In order make this tool even more useful and provide visualizations, a few changes are going to be made.
- The user will provide connection ID on the command line, and the script will be responsible for finding the underlying thread.
- The frequency at which the script requests memory data will be configurable, also via the command line.
- The
matplotlib
library will be used to generate a visualization of the memory usage. This will consist of a stack plot with a legend showing the top memory usage categories, and will retain the past 50 samples.
It's quite a bit of code, but is included here for the sake of completeness.
#!/usr/bin/env python3 import matplotlib.pyplot as plt import numpy as np import MySQLdb import argparse MEM_QUERY=''' SELECT event_name, current_number_of_bytes_used FROM performance_schema.memory_summary_by_thread_by_event_name WHERE thread_id = %s ORDER BY event_name DESC''' TID_QUERY=''' SELECT thread_id FROM performance_schema.threads WHERE PROCESSLIST_ID=%s''' class MemoryProfiler: def __init__(self): self.x = [] self.y = [] self.mem_labels = ['XXXXXXXXXXXXXXXXXXXXXXX'] self.ms = 0 self.color_sequence = ['#ffc59b', '#d4c9fe', '#a9dffe', '#a9ecb8', '#fff1a8', '#fbbfc7', '#fd812d', '#a18bf5', '#47b7f8', '#40d763', '#f2b600', '#ff7082'] plt.rcParams['axes.xmargin'] = 0 plt.rcParams['axes.ymargin'] = 0 plt.rcParams["font.family"] = "inter" def update_xy_axis(self, results, frequency): self.ms += frequency self.x.append(self.ms) if (len(self.y) == 0): self.y = [[] for x in range(len(results))] for i in range(len(results)-1, -1, -1): usage = float(results[i][1]) / 1024 self.y[i].append(usage) if (len(self.x) > 50): self.x.pop(0) for i in range(len(self.y)): self.y[i].pop(0) def update_labels(self, results): total_mem = sum(map(lambda e: e[1], results)) self.mem_labels.clear() for i in range(len(results)-1, -1, -1): usage = float(results[i][1]) / 1024 mem_type = results[i][0] # Remove 'memory/' from beginning of name for brevity mem_type = mem_type[7:] # Only show top memory users in legend if (usage < total_mem / 1024 / 50): mem_type = '_' + mem_type self.mem_labels.insert(0, mem_type) def draw_plot(self, plt): plt.clf() plt.stackplot(self.x, self.y, colors = self.color_sequence) plt.legend(labels=self.mem_labels, bbox_to_anchor=(1.04, 1), loc="upper left", borderaxespad=0) plt.xlabel("milliseconds since monitor began") plt.ylabel("Kilobytes of memory") def configure_plot(self, plt): plt.ion() fig = plt.figure(figsize=(12,5)) plt.stackplot(self.x, self.y, colors=self.color_sequence) plt.legend(labels=self.mem_labels, bbox_to_anchor=(1.04, 1), loc="upper left", borderaxespad=0) plt.tight_layout(pad=4) return fig def start_visualization(self, database_connection, connection_id, frequency): c = database_connection.cursor(); fig = self.configure_plot(plt) while(True): c.execute(MEM_QUERY, (connection_id,)) results = c.fetchall() self.update_xy_axis(results, frequency) self.update_labels(results) self.draw_plot(plt) fig.canvas.draw_idle() fig.canvas.start_event_loop(frequency / 1000) def get_command_line_args(): ''' Process arguments and return argparse object to caller. ''' parser = argparse.ArgumentParser(description='Monitor MySQL query memory for a particular connection.') parser.add_argument('--connection-id', type=int, required=True, help='The MySQL connection to monitor memory usage of') parser.add_argument('--frequency', type=float, default=500, help='The frequency at which to ping for memory usage update in milliseconds') return parser.parse_args() def get_thread_for_connection_id(database_connection, cid): ''' Get a thread ID corresponding to the connection ID PARAMS database_connection - Database connection object cid - The connection ID to find the thread for ''' c = database_connection.cursor() c.execute(TID_QUERY, (cid,)) result = c.fetchone() return int(result[0]) def main(): args = get_command_line_args() database_connection = MySQLdb.connect(host='127.0.0.1', user='root', password='password') connection_id = get_thread_for_connection_id(database_connection, args.connection_id) m = MemoryProfiler() m.start_visualization(database_connection, connection_id, args.frequency) connection.close() if __name__ == "__main__": main()
With this, we can do detailed monitoring of executing MySQL queries. To use it, first get the connection ID for the connection you want to profile:
SELECT CONNECTION_ID();
Then, executing the following will begin a monitoring session:
./monitor.py --connection-id YOUR_CONNECTION_ID --frequency 250
When executing a query on the database, we can observe the increase in memory usage, and see what categories of memory are the largest contributors.
This visualization can also help us to clearly see what kinds of operations are memory hogs. For example, here is a snippet of a memory profile for creating a FULLTEXT
index on a large table:
The memory usage is significant, and continues to grow into using hundreds of megabytes as it executes.
Note
For another example of how you can use MySQL to profile memory usage, see check out this DBAMA presentation and the corresponding GitHub repository.
Conclusion
Though it may not be needed as often, having the ability to get detailed memory usage information can be extremely valuable when the need for detailed query optimization arises. Doing this can reveal when and why MySQL may be cause memory pressure on the system, or if a memory upgrade for your database server may be needed. MySQL provides a number of primitives that you can build upon to develop profiling tooling for your queries and workload.