Character sets are like the force: they surround us and penetrate us, binding all our digital world together. A character set is how we convert the 1’s and 0’s that the computer understands into human-readable characters like ABC. In one of the first character sets, ASCII, the number 97 is translated to “a” and 63 is the question mark (?).
“Are there other languages besides English?”
“Don’t think so, Bob.”
The trouble with ASCII is that it was created back in the ‘60s by a bunch of Americans and they were not thinking about French or German, they were thinking about English. Guess what? ASCII works great for American English, but there’s no way it can encode even Spanish, let alone something like Chinese, so it only met the needs of around 5% of the world (that’s good enough, right?).
Encoding All Languages with Unicode
Well, a couple of Americans got together in the 80s, fixed the mistake of their fellow citizens after realizing that there are other people besides anglophones. They created a character set that could represent any language, naming it Unicode, in a fit of world domination. While the existence of other languages surprised developers, and continues to blindside many of them even today, eventually many programming languages and libraries adopted Unicode.
One of the barriers to adoption was that Unicode could take 2-4x as much space to represent a character. Oh, and there was already bunch of text in ASCII and programs that used ASCII. So, a smart man, Ken Thompson of Unix fame, created a version of Unicode called UTF-8 that is a superset of ASCII, so any valid ASCII text is valid UTF-8 text (though the opposite is not true), and it only uses one byte for ASCII characters. This made the path to adoption much easier.
(Sidenote: If you’re interested in learning more about character sets, look no further than Joel Spolsky’s blog.)
One of the hardest parts of character sets is that they’re often assumed, not stated explicitly. This makes it a joy to troubleshoot an issue with character sets. When you pull up a Web page, there is a ton of code that must understand what character set to use: the database, the application, the client-side code, the language runtime, the standard library, and the browser.
In one case, I dealt with an issue in PHP where some functions treated strings as UTF-8 and others as Latin1 (another common charset that can represent most European languages, and it’s the default charset in MySQL). The PHP maintainers then fixed this in the next point release, so you had to be careful which version of the documentation you were reading. I spent many days going back and forth with developers and testers as we tried to make sure the entire pipeline of our application was UTF-8. The bottom line was that we should’ve just built our software in UTF-8 from the beginning, instead of trying to fiddle with other character sets.
Enabling UTF-8 in MySQL
The great news about UTF-8 is that it has near universal adoption: it’s on Windows, Mac, and Linux, and most programing languages, text editors, and databases understand it. MySQL is no exception. I’d recommend using UTF-8 as the default with MySQL for two simple reasons: it enables true internationalization and it’s pretty much the only charset outside of ASCII that is universal.
For those green field development projects, you can enable UTF-8 pretty easily. Just add this to your my.cnf (usually found in /etc/mysq/my.cnf) under the heading [mysqld]:
character-set-server = utf8 collation-server = utf8_general_ci
(The collation is specifies how to sort words. It’s pretty safe to use utf8_general_ci, but google it if you’re unsure.)
If you have existing databases and tables, you can just convert them with these commands:
Alter Database <your_database> Character Set utf8 Collate utf8_general_ci; Alter Table <your_table> Convert To Character Set utf8 Collate utf8_general_ci;
Two things to note: if you’ve already stuffed, say, MacRoman into your Latin1 database, you can’t just convert it over and your data will be magically fixed. Second, you need to test this before you do this in production.
The other half of this is to make sure that you set make your application use UTF-8 when it talks to MySQL. In PHP, it’s as simple as putting “charset=utf8” in your connection string:
$connection_str = 'mysql:host=<your_server>;dbname=<your_database>;charset=utf8';
If this all seems complex and nasty, and you want nothing more than to pretend that charsets don’t exist, know that eventually you will have to deal with them. This little article and these two settings will pay off in the end. So, put on your grown-up pants and use UTF-8.
The next simple setting we’ll talk about is using InnoDB (or, the only sane general storage engine) in MySQL.
Want to hear more about what I’ve learned about MySQL? Well, I’m writing a book about it! You can subscribe below and I’ll send you updates. I’ll also send out stuff on MySQL like tips and tutorials.