Introduction to Character Set and Character Encoding For WordPress Users

Naresh Devineni
9 min readJun 21, 2018
<!DOCTYPE html>	
<html <?php language_attributes(); ?>>
<head>
<meta charset="<?php bloginfo( 'charset' ); ?>">
</head>

If you observe the above code snippet, in WordPress, we define the character set of the document using bloginfo( 'charset' ). Now let's see what output it is generating by viewing page source.

If you are like me, you would want to understand every piece of code that you write inside your WordPress theme.

So, why do you need to define charset on the HTML documents anyway?

How Computers store and manage the files

Something like "charset" exists because of the conflict between us(the humans) and the computer. We are good at consuming words with characters like A,B,C,D,a,b,&, ) etc. Because that's the way, we communicate or pass around information.

One general way of passing information around is by using computers. We create files such as word documents, powerpoint presentations, text documents ( Our code files are text documents too! ) and save them on the computer's memory, right? We keep modifying them all the time, and we also e-mail them to our peers.

The conflict is, computers do not store the files we have written as we see them. They convert these files into a binary format with 0's and 1's, because modern day computers only store and process 0's and 1's (Binary Numbers).

For example, If you open up a WordPress theme file, let's just say header.php file, If you view the header.php file in its raw binary form, all you see is 0's and 1's. Linux terminal has a useful command xxd -b header.php and you would execute this by navigating to a theme folder with header.php file. Here is the binary view of the header.php file.

Now if you view the same file using software like Code editor or some basic text editor, you would see something like this.

So, what is the trick here? Well, when you open up a file in software like code editor or powerpoint, the computer uses "Character Set" with "Character Encoding" and transforms the binary information to the plain-text information that humans can understand and vice versa when you are saving the file to the disk.

So what is a Character Set and a Character encoding anyway?

Character Set

A Character Set is nothing but a collection of characters and symbols like A, a,b,d,1,2, &, ), etc. For example, ASCII character set contains all the characters and symbols that we use in the English language.

But, the web is the collection of websites with many languages like Hindi, Chinese, Spanish, etc. So, "Unicode" character set became the popular choice for web pages, as it includes characters and symbols for most of the living languages in our world. Unicode encourages people to write website pages in their native language.

In Unicode, Each character is associated with a unique number, called its codepoint.

Character Encoding

Finally, A Character encoding tells the computer how to transform raw binary 0's and 1's into real characters like a, C, G, H, etc. It usually does this by pairing a unique number with each character. Simply put, it is all about encoding and decoding of binary information to/from characters. utf-8 is the most popular character encoding used along with Unicode Character set. The number 8 in utf-8 tells us that, every character is represented using 8-bits. Every 8 bits counts as 1 byte. For example, if you observe the below image, the letter H can be represented in binary as 01001000 ( 8 bits = 1 byte ). Some characters in languages like Chinese take more than 1 byte, usually 2 bytes to represent a character.

In the below image, you'll see equivalent binary information and their associated code point for the word "Hello"

It is important to remember that you have to save the computer files like word documents and code files in a particular character encoding. You don't have to worry about this because most software does this by default by using utf-8 character encoding. You can also configure your software to use a specific character encoding!

For example, my favorite code editor Visual Studio Code saves every code file using an utf-8 character encoding.

It is important to note that, Character Set and Character Encoding are two totally different concepts. But during the early days of computing, there was no clear distinction between these concepts. Hence, the terms Character set and Character encoding are used synonymously. For example, this confusion can be seen in HTML charset meta tag.

The value of the charset is "utf-8", but it's actually about character encoding, not charset, isn't it?.

Even WordPress and MySQL follows the path of HTML by using the term "charset" along with "utf8". So, from here on, if you see the term "charset" with a value of utf8, don't get confused.

To clear some confusion, HTML4 specification clearly mentions this.

The "charset" attributes ( %Charset in the DTD) refer to a character encoding as described in the section on character encodings. Values must be strings (e.g., "euc-jp") from the IANA registry (see [CHARSETS] for a complete list).

"So, How do the concepts mentioned above are related to HTML web pages and WordPress anyway?"

Why you should specify Character Encoding for HTML documents

How do we access web pages? Using the browser, right? When the browser receives response bytes from the server, it uses specified character encoding to transform those bytes into an internal representation, and after some processing, it finally renders the content like text and multimedia on to the screen.

Specifying the character encoding ensures that you can use characters from all the well recognized human languages in your HTML document, and the browser will render them reliably.

There are a few ways you can specify the character encoding of the HTML document. Using the charset meta tag is just one of them. You can also set the HTML document's character encoding on the server side. If you have set up Server-side character encoding, it takes precedence over the charset meta tag in your HTML document.

How WordPress deals with Character encoding

For code editors, we can specify the character encoding for every file we create. How about WordPress? How do we save our posts and pages in a specific character encoding? The answer is simple. WordPress by default uses the charset of the database's tables while saving whatever posts or pages you have written to the database. We have created a database using phpMyAdmin and linked it to our WordPress installation, remember?

If you observe the above image, we did not even bother to choose the collation field; we left it to phpMyAdmin to choose the default collation. A collation is nothing but a bunch of rules that specify how characters in Character Set can be compared for sorting (whatever that is :P). It is internal to the database. For more information, please read this article from MySQL website.

The catch here is, WordPress does not care about the Character Set nor collation of the database. It does not create a database for us; we created it ourselves. But during installation, WordPress does create tables within the chosen database. It only cares about the character set and collation of the tables that it created within that database. Database creation is totally independent of table creation.

Now, this brings up the final three questions. How does WordPress determine which character set to use while creating tables? Does WordPress give us the control to choose the character of our liking? Could we change the Character Set in the middle of the project?

How does WordPress determine the character set of the tables?

If you have installed WordPress using the famous five-minute installation, During the installation, once you have entered the database connection details and hit submit button, and before you hit "Run the installation" button, WordPress generates the wp-config.php file with database charset set to utf8mb4.

//** Database Charset to use in creating database tables. */
define('DB_CHARSET', 'utf8mb4');

/** The Database Collate type. Don't change this if in doubt. */
define('DB_COLLATE', '');

WordPress determines the DB_CHARSET value based on the MySQL version, database's collation inside the wp-config.php file and WordPress version. If your MySQL version is less than 5.5.2 and if the version of the WordPress you are installing is less than 4.2, WordPress could have chosenutf8 charset for its tables instead.

The difference between utf8 and utf8mb4 is simple. utf8 allows us to store characters that use 3 bytes in computer's memory. utf8mb4 allows us to store 4 bytes long characters.

After choosing the charset, WordPress checks the value of the constant DB_COLLATE inside the file wp-config. If you leave it empty, WordPress will use the least limiting collation from chosen utf8 family, otherwise, will use the value specified by you.

When it comes to utf8, Collations are divided into two families. uft8 and uft8mb4. You can take a look at these families inside phpMyAdmin.

If you want to change the charset or collation of the database tables and if you want to get it done easily, change them before you hit "Run the installation" button during WordPress installation. You could also change the charset in the middle of the project by following this fantastic WordPress Codex article on converting database charsets. In most cases, sticking with default values is a good choice.

Here is the real reason why I am explaining all these concepts. I am not trying to scare you off at the beginning of your theme development journey. I am doing this because most popular web hosting services out there are still using older outdated MySQL version while local servers like MAMP and WAMP upgrading to the newer versions of PHP and MySQL on a regular basis. And, it is a big problem when we migrate our WordPress websites from our local machines to remote servers.

For example, When I tried to import my local "Dosth" database on MediaTemple, a Popular web hosting service, It still cannot recognize charsetutf8mb4 and throws this error!

Since I migrate around 5 WordPress websites every week, I had to switch to GoDaddy which supports utf8mb4 to give Work-in-progress links to my clients. Although we have the patience to convert from one character set to another, We can not compromise the benefits of newer and advanced charsets like utf8mb4, right?

All I am trying to tell you is, no matter how advanced our local servers are, if the remote servers are not updated to the newer/better standards, you have to convert/degrade your database charset and sometimes, this will result in gibberish characters all over the website.

For example, utf8mb4 allows us to use complex icons like 𝌆. Now, if you had to host your client's WordPress website on a remote server which does not support utf8mb4, all the icons and characters that rely solely on this charset would appear gibberish! I even lost a client because of charset issues :(

Uffffff, Now that you understand and know the importance of Character Set and Character Encoding, you could quickly solve the problems related to unreadable characters and you could educate your clients on choosing the better remote server which uses latest versions of MySQL and PHP.

If you are still confused, I am sorry, I failed to explain these concepts in a better way. So, Please read The ideal off-site resource Unicode Basics first. This article is an excellent introduction to Character Sets and Character Encoding. Once you are done with this article, read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Stack Overflow's Founder Joel Spolsky.

--

--