Character sets and collations in MySQL
By Aaron Francis |
Character sets and collations are fundamentally important concepts to understand when dealing with string columns in MySQL. A slight misunderstanding of either can lead to poor performance or unexpected errors when inserting data.
A character set defines the characters allowed to go in a column. A collation is a set of rules for comparing those characters. Each character set can have multiple collations, but a collation may only belong to one character set.
Character sets in MySQL
MySQL supports a wide range of character sets, which you can view by selecting from the information_schema
database.
SELECT * FROM information_schema.character_sets ORDER BY character_set_name
This will list out all of the character sets, along with their default collations. Every character set has one default collation.
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN | |--------------------|----------------------|---------------------------------|--------| | armscii8 | armscii8_general_ci | ARMSCII-8 Armenian | 1 | | ascii | ascii_general_ci | US ASCII | 1 | | big5 | big5_chinese_ci | Big5 Traditional Chinese | 2 | | binary | binary | Binary pseudo charset | 1 | | cp1250 | cp1250_general_ci | Windows Central European | 1 | | cp1251 | cp1251_general_ci | Windows Cyrillic | 1 | | cp1256 | cp1256_general_ci | Windows Arabic | 1 | | cp1257 | cp1257_general_ci | Windows Baltic | 1 | | cp850 | cp850_general_ci | DOS West European | 1 | | cp852 | cp852_general_ci | DOS Central European | 1 | | cp866 | cp866_general_ci | DOS Russian | 1 | | cp932 | cp932_japanese_ci | SJIS for Windows Japanese | 2 | | dec8 | dec8_swedish_ci | DEC West European | 1 | | eucjpms | eucjpms_japanese_ci | UJIS for Windows Japanese | 3 | | euckr | euckr_korean_ci | EUC-KR Korean | 2 | | gb18030 | gb18030_chinese_ci | China National Standard GB18030 | 4 | | gb2312 | gb2312_chinese_ci | GB2312 Simplified Chinese | 2 | | gbk | gbk_chinese_ci | GBK Simplified Chinese | 2 | | geostd8 | geostd8_general_ci | GEOSTD8 Georgian | 1 | | greek | greek_general_ci | ISO 8859-7 Greek | 1 | | hebrew | hebrew_general_ci | ISO 8859-8 Hebrew | 1 | | hp8 | hp8_english_ci | HP West European | 1 | | keybcs2 | keybcs2_general_ci | DOS Kamenicky Czech-Slovak | 1 | | koi8r | koi8r_general_ci | KOI8-R Relcom Russian | 1 | | koi8u | koi8u_general_ci | KOI8-U Ukrainian | 1 | | latin1 | latin1_swedish_ci | cp1252 West European | 1 | | latin2 | latin2_general_ci | ISO 8859-2 Central European | 1 | | latin5 | latin5_turkish_ci | ISO 8859-9 Turkish | 1 | | latin7 | latin7_general_ci | ISO 8859-13 Baltic | 1 | | macce | macce_general_ci | Mac Central European | 1 | | macroman | macroman_general_ci | Mac West European | 1 | | sjis | sjis_japanese_ci | Shift-JIS Japanese | 2 | | swe7 | swe7_swedish_ci | 7bit Swedish | 1 | | tis620 | tis620_thai_ci | TIS620 Thai | 1 | | ucs2 | ucs2_general_ci | UCS-2 Unicode | 2 | | ujis | ujis_japanese_ci | EUC-JP Japanese | 3 | | utf16 | utf16_general_ci | UTF-16 Unicode | 4 | | utf16le | utf16le_general_ci | UTF-16LE Unicode | 4 | | utf32 | utf32_general_ci | UTF-32 Unicode | 4 | | utf8 | utf8_general_ci | UTF-8 Unicode | 3 | | utf8mb4 | utf8mb4_0900_ai_ci | UTF-8 Unicode | 4 |
At the bottom of this table, you'll notice two character sets described as UTF-8 Unicode
. The utf8
charset has a MAXLEN
of 3 while the utf8mb4
has a MAXLEN
of 4. What's being described here is the maximum allowed length, in bytes, per character.
According to the UTF-8 spec, each character is allowed four bytes, meaning MySQL's utf8
charset was never actually UTF-8 since it only supported three bytes per character. In MySQL 8, utf8mb4
is the default character set and the one you will use most often. utf8
is left for backwards compatibility and should no longer be used.
How do you define a character set?
There are a few ways to define the character set of a column. If you don't specify a character set at the table or column level, the server default of utf8mb4
will be applied (unless you've explicitly declared a different server or database default).
We can prove this by creating a table with no character set information and then reading it back:
CREATE TABLE no_charset ( my_column VARCHAR(255) ); SHOW CREATE TABLE no_charset;
The resulting CREATE TABLE
statement shows that the default charset and collation have been applied.
CREATE TABLE `no_charset` ( `my_column` varchar(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Defining at the table level
Instead of allowing the database or server default to apply, you can explicitly set the character set at the table level by using the CHARSET=[charset]
notation. Here, we'll create a table where all character columns have the latin1
charset:
CREATE TABLE `no_charset` ( `my_column` VARCHAR(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1
Defining at the column level
Finally, you can set the character set at the column level. This is the most specific and overrides any table-level settings.
CREATE TABLE `mixed_collations` ( `explicitly_set` VARCHAR(255) CHARACTER SET latin1, `implicitly_set` VARCHAR(255) );
Reading this table back with the SHOW CREATE TABLE
statement makes it clear that the table is utf8mb4
, but the explicitly_set
column is latin1
:
CREATE TABLE `mixed_collations` ( `explicitly_set` VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL, `implicitly_set` VARCHAR(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
This is the most specific way to declare a character set.
A column-level specification will override a table-level specification, a table-level specification overrides the database default, and a database-level charset overrides the server default.
Collations in MySQL
While character sets define the legal characters that can be stored in a column, collations are rules that determine how string comparisons are made. If you are sorting or comparing strings, MySQL uses the collation to decide the order and whether the strings are the same.
You can show all the collations by querying the information_schema
table again. There are a lot of collations, so we'll restrict the results to only collations that apply to the utf8mb4
charset.
SELECT * FROM information_schema.collations WHERE character_set_name = 'utf8mb4' ORDER BY collation_name
This query will display all the collations, related character set names, whether they are default, and a few other pieces of information. Notice that for each character set, there is one default collation. For example, utf8mb4_0900_ai_ci
is the default collation for the utf8mb4
character set.
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE | |----------------------------|--------------------|-----|------------|-------------|---------|---------------| | utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD | | utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD | | utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD | | utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE | | utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | PAD SPACE | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 0 | NO PAD | | utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | | Yes | 0 | NO PAD | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | PAD SPACE | | utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | PAD SPACE | | utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 0 | NO PAD | | utf8mb4_da_0900_as_cs | utf8mb4 | 290 | | Yes | 0 | NO PAD | | utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 0 | NO PAD | | utf8mb4_de_pb_0900_as_cs | utf8mb4 | 279 | | Yes | 0 | NO PAD | | [omitted for brevity] | ... | ... | | ... | ... | ... | | utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | PAD SPACE | | utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 0 | NO PAD | | utf8mb4_vi_0900_as_cs | utf8mb4 | 300 | | Yes | 0 | NO PAD | | utf8mb4_zh_0900_as_cs | utf8mb4 | 308 | | Yes | 0 | NO PAD |
Collations follow a naming scheme whereby the character set forms a prefix, and the suffix combines attributes of the collation.
Here is a breakdown of a few of the suffixes you might see:
| Suffix | Meaning | |--------|--------------------| | _ai | Accent-insensitive | | _as | Accent-sensitive | | _ci | Case-insensitive | | _cs | Case-sensitive | | _ks | Kana-sensitive | | _bin | Binary |
Let's take the default utf8mb4
collation of utf8mb4_0900_ai_ci
and expand it slightly.
The utf8mb4
part declares it belongs to the utf8mb4
charset. The 0900
references the UCA 9.0.0 weight keys. _ai
means the collation is accent-insensitive, while _ci
declares it case-insensitive.
This allows us to confidently answer the question, "are string comparisons case-sensitive?" The answer, of course, is: it depends! It depends on the collation.
Let's prove this by explicitly casting strings using the COLLATE
keyword.
SELECT "MySQL" COLLATE utf8mb4_0900_ai_ci = "mysql" COLLATE utf8mb4_0900_ai_ci;
Running this statement gives us a value of 1
, meaning MySQL treats the two strings as equal. If we were to rerun it with a case-sensitive collation, we'd expect (and obtain!) a different result:
SELECT "MySQL" COLLATE utf8mb4_0900_as_cs = "mysql" COLLATE utf8mb4_0900_as_cs;
This query returns a value of 0
, meaning MySQL sees these strings as unique because they are cased differently.
The same logic holds for accent sensitivity. With an accent-insensitive collation, résumé
and resume
would be deemed identical because the accents would be ignored.
How do you define a collation?
Like character sets, collations can be set at both the table and column levels. If a collation is not explicitly defined, MySQL uses the default collation of the character set.
To define a collation at the table level, you can use the COLLATE
clause in the CREATE TABLE
statement. For example, you can create a table where all character columns use the utf8mb4_bin
collation:
CREATE TABLE table_with_collation ( my_column VARCHAR(255) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
If you want to define the collation at the column level, you can do so in the column definition. The following example creates a table with two columns: explicitly_set
uses the utf8mb4_general_ci
collation, and implicitly_set
uses the default collation from the utf8mb4
charset, which is utf8mb4_0900_ai_ci
.
CREATE TABLE table_with_collation ( `explicitly_set` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci, `implicitly_set` varchar(255) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
You can also change the collation of a column in an existing table using the ALTER TABLE
statement:
ALTER TABLE table_with_collation CHANGE `explicitly_set` `explicitly_set` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
Summary
Understanding character sets and collations is fundamental when dealing with string data in MySQL. A character set defines the legal characters that can be stored in a column, while a collation determines how string comparisons are made.
- A character set can be defined at the column level, the table level, or it can be inherited from the database or server default. The most specific level (column > table > database > server) is used.
- A collation can be defined at the column level, the table level, or it can be inherited from the character set default. Again, the most specific level is used.
- The character set and collation of a column affect how data is stored and how it is compared and sorted. Be mindful of these settings to ensure the correct behavior and optimal performance when designing your database.
If you are unsure which character set or collation to use, the MySQL default utf8mb4
character set and its default utf8mb4_0900_ai_ci
collation are usually good choices. They support all Unicode characters and provide case-insensitive and accent-insensitive comparisons.