Character Encodings For Modern Programmers

April 25, 2014 at 6:36 pm · Filed under Technical

The easiest way to understand the current state of encodings is to look at the history of how we got to where we are today. The pre-ASCII days are not particularly relevant, but they are interesting, so I have run through the various encodings that preceded ASCII from a technical perspective here if you are interested. The story of Unicode really starts with ASCII. Note that since Windows and UNIX are the only two types of operating systems of interest to most modern programmers, I’m only going to discuss them. (Linux, iOS, Android, etc. are all UNIX-based). First I’m going to go through all the problems of the various pre-Unicode ASCII based encodings, and then I’ll provide some programming pointers for cross-platform programming. You can skip the pre-Unicode stuff if you’re not interested in the history

ASCII

ASCII is important because it basically defined the subset of punctuation marks and symbols that would be used in every programming language and operating system that followed. Just to make things absolutely clear, ASCII is a 7-bit encoding. There is no such thing as 8-bit ASCII. If you ever hear anyone talking about 8-bit ASCII, what they actually mean is an 8-bit encoding that uses the printable ASCII characters (32 to 126) plus maybe 2 or 3 of the control codes (LF, CR, TAB). Now, you might not realize it, but ASCII was actually designed to be a family of 7-bit encodings with some characters designated “national variants” which would be defined differently in different countries. For example, the British variant of ASCII replaces the hash (#) symbol with the pound (£) sign, while the Norwegian version replaces square brackets ([]) with Æ and Å and braces ({}) with æ and å. While this is not a problem for telegraph, it creates obvious problems for programming. C, for instance, makes extensive use of square brackets and braces. The result is trigraphs in C, where the sequences ??( and ??) can be used instead of square brackets, and ??< and ??> can be used instead of braces. (See ISO/IEC 646 and trigraphs for more information.) Given this mess, it wasn’t long before the national variants of ASCII disappeared and were replaced by 8-bit encodings. When people talk about ASCII these days, they are almost always referring to the original US variant of ASCII which was the basis for C and most other modern programming languages.

ASCII-based 8-bit Encodings

Switching from a 7-bit encoding to an 8-bit encoding gives us an extra 128 code points, and this fixes a lot of problems. Most importantly, it means that all of the national variant characters can be moved out of the lower 128 codes. Now, we have a truly common basic set of 128 characters that OS and language creators can freely use as standard characters without having to worry about trigraphs or other cludges. Secondly, we can now support languages such as Greek and Cyrillic (Russian) which require many more additional characters than could fit in the limited number of national variant characters in ASCII.

UNIX, Terminals, and C1

UNIX is essentially a terminal-based system. For the youth out there who don’t know what a dumb terminal is, it’s a piece of equipment with a character-based screen (typically around 80×25 characters) and a keyboard which gets connected to the UNIX server by a serial line. If you send text to the serial port of the terminal, it is the terminal that decides which actual character to print on the screen. Similarly, when you press a key on the keyboard, it is the terminal that decides which code to send to the UNIX server. On top of this, this same communication channel needs to be used for sending control sequences, such as for moving the cursor, clearing the screen, and these sequences are all sent in-line intermixed with the character codes. This means that we can’t assign characters to every code point. We need control codes, and these control codes cannot overlap the characters in the character encoding.

Now, the block of control codes 0 to 31 (known as the C0 block) as well as code 127 in ASCII are already allocated to control codes. Their meanings might not necessarily match the ASCII definitions, but that is irrelevant. They cannot be used for encoding printable characters and many have well-defined functions. Next we have the printable ASCII range of 32 to 126. Virtually all of these have some special significance (be it as command names, directory separators, shell escape characters, programming language symbols, etc.), and cannot be changed. Codes 128 and above, however, are a free-for-all. Terminals can use them as control codes, printable characters, or basically anything and it doesn’t matter to the UNIX server. As long as all of the different terminals connected to a server agree on the meanings of the printable characters, everything will work fine. We can even connect terminals that use different control characters, as long as those blocks of control characters do not overlap the printable characters.

For reasons of consistency and interoperability, a recommendation was developed that codes 128 to 159 should be reserved for a second control block (called the C1 block), and only codes 160 to 255 be used for printable characters, and this formed the basis for most pre-Unicode UNIX encodings. However, it is not a hard and fast rule, and UNIX itself does not treat the C1 block any differently from the rest of the upper 128 codes.

This makes UNIX basically language agnostic. Let’s say we have a UNIX system and we have two files with names consisting of the character codes 68 196 and 68 228. When we connect a Greek language terminal to the system and list the files, we will see “DΔ” and “Dδ”. When we connect a French terminal to the system, we will see “DÄ” and “Dä”, and when we connect a Thai terminal we will see “Dฤ” and “Dไ”. The fact that the three different terminals show 3 different things doesn’t matter. A Thai user can still open one of the files by typing “vi Dฤ” on their keyboard, the same as the Greek can by typing “vi DΔ”. As far as the OS is concerned, as long as the underlying codes are the same, everything works fine. We might note here that “δ” is the lower case letter for “Δ” in Greek, whereas “ฤ” and “ไ” are completely different unrelated letters in Thai. This doesn’t matter to UNIX because the OS doesn’t try to do anything fancy like case-insensitive file names. The same applies to file content. If you print a file to the screen, the codes gets passed directly to the terminal, and it is the terminal that does the rendering. The OS doesn’t need to get involved.

In fact, on a UNIX system we can simultaneously use as many different encodings as we like. We could create separate subdirectories for different languages, and as long as the users accessing those subdirectories used terminals set up for the same encoding, everything will work fine. Thus UNIX itself does not need an “encoding” setting. It simply doesn’t care. In Thailand, we connect Thai language terminals to our server. In Greece we connect Greek language terminals.

However, any command or system service that is going to output human-readable error messages needs to know which language to output. Similarly, any add-on programs, particularly those that might perform some kind of collation or text processing, want to know what language they should use. But again, this does not need to be a system-wide setting, and only need apply to a particular user’s session. The system thus uses an environment variable (LOCALE) which specifies both the language and character encoding.

DOS/Windows

DOS (and later Windows 95/98/Me), on the other hand, is a different kettle of fish. First, there is the matter of control codes. Since the screen is addressed directly through the video adapter, we don’t need any control codes. The original DOS encoding (code page 437) assigned printable characters to all of the ASCII control codes (0 to 31 and 127) as well as to all of the codes above 128, but is still considered ASCII-based because it maintains all of the ASCII printable characters (32 to 126). Now it turns out that this is a step too far. While the line feed character might be meaningless if we are directly addressing the screen, it is certainly helpful for indicating new lines in text files. We also run into problems communicating say with an ASCII serial printer which uses the ASCII control block for printer control sequences. For compatibility with any kind of hardware device, we really can’t assign printable characters to the C0 control block, and Windows 95 thus used encodings that do not use C0. However, Windows 95 has no need for terminal control codes, and the C1 control block (128 to 159) is thus assigned to printable characters.

The other issue which is particularly relevant for Windows 95/98/Me is that it uses case-insensitive file names, and for this to work, the OS needs to know the encoding. For example, the codes 196 and 228 are the same letter – lower case delta (“δ”) upper case delta (“Δ”) in Windows-1253, but different letters (“ฤ”) and (“ไ”) in Windows-874. Windows thus requires a system-wide encoding setting. This is not such a big problem for single-user systems, but when we start networking computers together in a business setting, things go awry. For example, on a Thai language computer, we can have two files named “Dฤ” and “Dไ”. However, if we try to copy these over the network onto a Greek-based computer, we get a problem. As far as the Greek computer is concerned, the two files have the same file name, and one file will mash the other file. We cannot have a single file server that stores files from all of our different international offices. It isn’t going to work. Similarly, we can’t create a web server that supports multiple different sites that use different languages. It should come as no surprise then that Microsoft was one of the big backers of Unicode and produced one of the first Unicode-based OSes.

Programming With 8-Bit Encodings

At this point, programming is actually pretty easy. In most cases, you don’t even need to write encoding-aware software. A C compiler, for example, doesn’t need encoding-awareness. All localized characters (codes 128 and higher) can only appear in comments or string/char literals. Comments are ignored and literals are simply copied into the data as-is. Things like regular expressions (and so vi/sed/awk/etc.) similar don’t need to know the encoding. All characters are byte-based, so as long as the person using these programs inputs the correct expressions for their language, they will work fine. (For example, a Norwegian person wanting to match all upper case letters is going to have to use the regexp /[A-ZÆØÅ]/ instead of /[A-Z]/, but the actual regular expression parser does not need to be changed). The basic assumption is that we have the same encoding end-to-end, so tools don’t have to do any conversion, they just spit out whatever input they get in.

If we do want to do language-dependent operations such as collation or upper/lowercase conversion, the set of functions needed to achieve this is minimal. Collation is not even truly an encoding issue, since sort orders can vary between languages even if they use the same encoding. In C, this is handled using a few simple functions. The locale setting (in <locale.h>) defines both the language and encoding. <ctype.h> is the only encoding library we need, with functions for determining the type of a character (isnum(), isupper(), islower(), isalpha(), etc.) and converting upper to lower case (toupper()/tolower()). Collation is language as well as encoding dependent, and is supported by strcoll() and strxfrm() in <string.h>. This is all we need. Here is some example code you can run to see it in action.

#include <locale.h>
#include <ctype.h>
#include <stdio.h>
int main() {
	// West Europe: 198 = Æ, 230 = æ (upper case and lower case letters)
	setlocale(LC_CTYPE, ".1252");	// On UNIX, use ".iso-8859-1"	
	printf("%d %d\n", (isupper(198) ? 1 : 0), (isupper(230) ? 1 : 0));
	printf("%d %d\n", (islower(198) ? 1 : 0), (islower(230) ? 1 : 0));
	printf("%d %d\n", (isalpha(198) ? 1 : 0), (isalpha(230) ? 1 : 0));

	// Thai: 198 = ฦ, 230 = ๆ (letters, but neither upper or lower case)
	setlocale(LC_CTYPE, ".874");	// On UNIX, use ".iso-8859-11"
	printf("%d %d\n", (isupper(198) ? 1 : 0), (isupper(230) ? 1 : 0));
	printf("%d %d\n", (islower(198) ? 1 : 0), (islower(230) ? 1 : 0));
	printf("%d %d\n", (isalpha(198) ? 1 : 0), (isalpha(230) ? 1 : 0));
	return 0;
};

Multibyte Encodings

All of the encodings we’ve discussed up to now have been single byte encodings, with one byte = one character. However, for the East Asian languages of China, Japan, and Korea (often abbreviated CJK), one byte simply isn’t big enough to hold all of the possible combinations. The only choice is to use multiple bytes to represent a single character. Since all of these encodings still use single bytes for ASCII, they are all variable-length encodings. For compatibility with ASCII-based systems, the characters are generally arranged into pages of 94×94 (2 bytes) or 94x94x94 (3 bytes) characters which can either be overlaid on the character codes 33 to 126 (for transfer across 7-bit communication channels – an idea that is only really used in practice for SMTP-based email) or the character codes 161 to 254 (for 8-bit encoding). The preferred encoding for these character sets on UNIX is a system called EUC (extended unix code). The basic idea is that all non-ASCII letters are encoded as multi-byte sequences with all bytes in the 128 and higher code range. This means that we can suddenly start naming files in Japanese and Chinese without making any changes to the underlying OS, and in fact a lot of parsers and compilers are still going to work fine without making any specific changes to support multi-byte characters. We can still run an encoding-agnostic C compiler, since the multi-byte sequences only occur in comments and literals.

However, there are issues. We can’t just write char mychar = '字'; since this will appear to the compiler as 2 characters. We also have substring searching issues which mess with things like regular expressions. Consider the string “月月”. This consists of the 4 bytes 183 238 183 238. Let’s try searching for the character “詞”. We shouldn’t get a match, except that the encoding of “詞” is 238 183. For a lot of things, this isn’t going to be an issue, because the kinds of keywords we want to search for with reg exps in programs tend to be ASCII, but it does close the option. Finally, we have an issue that is virtually never addressed by any standard library, which is the problem of display width. Most CJK characters take up the same space as two ASCII characters when output to a fixed-width display such as a UNIX terminal, Windows command prompt, or fixed-width printer. For example, if we want to print a nicely formatted table of text data from a DB and we only allocate 20 characters worth of space for a large text column, there is nothing in the C standard library or in our database string functions that can extract the first 20 ASCII-characters wide worth of data. mblen() can tell us that a character uses 4 bytes of data, but there is no corresponding function to tell us if that character uses 2 ASCII character cells or only 1.

In the world of DOS and Windows 95, however, things are much worse. As an example, let’s look at Japanese. NEC licensed MS-DOS from Microsoft and modified it to create a version of DOS called DOS/V which worked with the Japanese multibyte encoding called Shift JIS. (They needed to modify the source code to work with the specialized hardware for displaying Japanese characters, something I’ve covered before.) The reason for choosing Shift JIS is that it plays nice with terminal-style fixed-width font row/column displays. Single byte Shift JIS characters always take up a single character cell on the screen, and double-byte Shift JIS characters always take up two character cells. We can thus always represent an 80-column row using char[80] without messing about with variable-width buffers. However, in order to achieve this magic, two-byte sequences in Shift JIS limit the first character to the range 128 to 255, but allow the second character to extend into the ASCII range with codes from 64 to 126 + 128 to 252. (There is a gap at 127 for the ASCII DEL control code). Probably the biggest compatibility-killer in this range is the code 92 – the backslash. There are 42 different characters which have this code as the second character. Why is this a problem? Well, try this innocuous example:

printf("十");

What happens when you try to compile this without rewriting your C compiler? You get an error. The C compiler sees something equivalent to:

printf("X\");

The second byte is interpreted as a backslash that escapes the closing quotation mark, and we have a string that is not closed. While we are going to get a compilation in this example, for many other cases the code compiles and runs but with occasional garbling of text. If you tried connecting a Shift JIS terminal up to a UNIX box, you’re going to have the same problem. If any file name or string contains this character, the shell is going to perform escape processing and mangle the string. We can’t even escape the problem character, because our escape will only apply to the first byte of the multi-byte. The Chinese encodings used in Windows (Big5 in Taiwan and GBK in mainland China) function basically the same way as Shift JIS, using characters 64 to 126 + 128 to 254 for the second byte. Even today, descendents of these encodings are used in Japanese and Chinese versions of Windows 7 and 8 as well as in loads of industrial and embedded computing equipment.

Prior to Unicode, the C and C++ standard libraries do not have any functions for handling CJK encodings. There is the mblen() function which tells us how many bytes a character takes, but we can’t do anything with a character once we extract it. We can’t use the <ctype.h> functions and we can’t even use the strstr() function in <string.h> or the string::find() function in <string> because of the possibility of false matches. Microsoft Visual C does provide an <mbstring.h> library that you can use, but you lose compatibility with UNIX. Your best option is to provide your own library.

Pre-Unicode Summary

At this point, I want to briefly summarize the encoding world before Unicode. Firstly, just about every application runs off the idea of a single encoding end-to-end. On UNIX, in particular, many protocols are encoding agnostic. They simply take encoded data and pass it directly through. FTP is a great example. It has no concept of encoding or any way of specifying encodings. Whatever raw character data it receives it sends on through untouched. It’s up to the sys admin to make sure everyone using it sticks to the same encoding. We can even have Windows boxes talking in Windows-specific encodings that can use a UNIX FTP server (with the specific exceptions of Chinese and Japanese), and in fact this is something that still occurs today.

On Windows 95/98/Me, things get messed up if we mix encodings, so we really have to stick strictly to the single encoding deal. For programmers, if you want your program to work on CJK systems, you have to use the non-standard <mbstring.h> library in MSVC or some equivalent third-party library.

Let’s have a look at a typical scenario where we have a dev on a Windows box using a UNIX server to host an app. We create some html files on a Windows box and give them international (Greek/Thai/whatever) names. When we upload the files to a UNIX box via FTP, the files retain their Windows encoded names. To access them via http, you again specify the Windows encoding of the names (using %xx notation for chars above 128) and the browser fetches the files. At no stage does it matter what language or encoding the owner of the UNIX box is using. The same applies to things like includes or local file access from script/cgi. Since the html/source files were edited on Windows, they get saved in that same Windows encoding. However, scripts run into problems if they perform any collation/upper/lowercase conversions, since the LOCALE environment setting from the UNIX box isn’t going to match the content of the files, so we have to set the LOCALE setting manually in our script. Similarly, we have to specify the file encoding in the <meta http-equiv="Content-Type" content="text/html; charset=xxx"> tag to override the default value of the http server. Finally, if we are going to use a database, we similarly have to make sure that we explicitly set the correct encoding (i.e. the Windows encoding) instead of using the default value from the UNIX box. Although I used Windows as an example, the client can be Mac/Linux/anything. We just need to make sure that the encoding is always set to the encoding of the developer client, not the encoding of the server.

At this point, things actually work pretty well. There are only 2 issues. One is the complete lack of support for CJK, and the other is the inability, for example, to mix Greek letters with Western European script.

Unicode

The very first thing we need to understand is that UTF-8 did not exist and was not envisioned when Unicode first came into use, and Unicode referred basically to what we now call UTF-16. The way Unicode was envisioned is best exemplified by the operation of Windows NT. Windows NT was the first major OS based on Unicode, and was coupled with NTFS, the first major filesystem to support Unicode. However, NT still had the concept of a default non-Unicode encoding, the same as DOS and Windows 95. This is necessary so that we can interchange floppy disks (which use the FAT) filesystem, text files (which use the default encoding for whatever language of system we are on), programs, etc. with the non-Unicode Windows 95 operating systems. This non-Unicode encoding is called the ANSI code page on Windows. The Win32 API thus comes with two complete sets of API, the ANSI API which accept string arguments of type char* encoded in the default non-Unicode encoding, and the Unicode (or wide char) API which accept string arguments of type wchar_t*. Devs can now use the Unicode API and never have to worry about variable-length encodings ever again, or they can use the ANSI API for backward compatibility with old pre-Unicode libraries and Windows 95.

On UNIX, things were slightly different, and this is reflected in the C/C++ standard libraries. On UNIX, the expectation was that people would continue to use non-Unicode encodings in conjunction with the LOCALE environment variable, but programmers would be given the option of an automatic conversion to Unicode mode of file operation, that would allow them to code using Unicode while the underlying OS and file content would continue to use non-Unicode encodings. At this point, we need to understand the concept of “file orientation” that was introduced into C. When we open a file in C (using fopen()), the file does not have an orientation. If we call a non-Unicode file function such as fgets(), the file gets set to non-Unicode orientation, and we simply get passed the data from the file. However, if we call a Unicode file function such as fgetws(), the file is put into Unicode orientation. This does not mean that the file on the disk is treated as containing Unicode. Instead, the C standard library assumes that the file is encoded in the non-Unicode encoding as specified by LOCALE and performs implicit automatic conversion of the file content from that non-Unicode encoding into Unicode in the form of wchar_t*. This was supposed to solve the problems we saw above with regards to multi-byte encodings. We just search/replace all char* with wchar_t* and file calls with the corresponding “w” version, and suddenly all of our parsers, regular expression libraries, etc. will work fine with CJK encodings. It is this basic idea of how Unicode would work that permeates the C and C++ standard libraries.

On Windows NT, however, we have a problem. The C standard library still assumes that files are saved in a non-Unicode filesystem. It only offers automatic conversion and unicode support for file content, but offers no support for unicode filenames. There is no variant of the fopen() function which accepts a wchar_t* for the filename. On Windows NT with NTFS, we can easily name files not only with characters that do not exist in the user’s default ANSI codepage encoding, but also with characters that do not exist in any pre-Unicode encoding. The C standard library is fundamentally broken in this respect. If you use the Microsoft C compiler, there is a non-standard function _wfopen() which you can use, and this give you the best hope of compatibility with UNIX. The C++ API are similarly broken, even as far along as C++11. fstream::open() and friends only accept char* for the filename. If you are using MS C++, there are non-standard overloads that accept wchar_t*.

Because of this underlying concept used by the C standard library, it offers zero support for reading or writing Unicode strings in file content. If you use the Unicode wfstream which accepts wchar_t* for arguments, it performs automatic conversion to and from the LOCALE encoding. You can read and write wchar_t* strings as binary data, but you lose access to things like fgets() and fprintf().

At this point, we again have two disruptive events that change things all over again. First, we might try to understand why wchar_t is 32 bits on UNIX but 16 bits on Windows. When Unicode was being developed, there were actually 2 competing standardization attempts. Unicode was working on a 16-bit universal encoding and ISO was working on a 32-bit universal encoding (called UCS). As we can see from the above, the UNIX vision of a universal encoding did not actually involve saving that encoding onto disks as content or as filenames, but merely as a useful tool for uniform handling of those pesky CJK encodings. They had thus put their eggs in the ISO basket and chose the 32-bit wchar_t. Windows needed an encoding to save to disk (and also the ISO standard was bogged down in politics and going nowhere), so they chose the 16-bit wchar_t and Unicode. Once NT was released with Unicode, it was clear that they had won, and ISO adopted Unicode as the basic plane of UCS. However, they wanted more characters than Unicode could fit, and so we ended up with a 32-bit Unicode standard with UTF-16 as a variable-length 16-bit encoding.

The second disruptive event was the development of UTF-8, yet another variable-length encoding. An interesting thing about UTF-8 is that it was strongly resisted by Japanese developers. They had spent years of tearing their hair out because the variable-length Japanese encodings had not been compatible with most of the libraries and software developed in Western countries, and they were concerned that another variable-length encoding would suffer the same problems. However, both UTF-8 and UTF-16 have a nice property that prevents many of these problems. That is, the first char (UTF-8) or wchar_t (UTF-16) of any character can never appear in the second or subsequent char/wchar_t in any other character. This solves the problem of false-matches that the CJK encodings suffered from in regular expressions and strstr()-style string searching.

Once UTF-8 was released, UNIX had a clear path towards total Unicode compatibility. Simply adopt UTF-8 as your standard encoding and all your problems go away. While some versions of UNIX had backward compatibility issues to consider, Linux did not, and UTF-8 is the basic standard encoding used in Linux and it’s derivatives. Since UTF-8 can be encoded in a char*, it even works fine with the broken C standard libraries. UTF-8 plugs into the C standard library on what was meant to be the non-Unicode side since it is stored in char* and specified as an encoding using LOCALE. However, Linux/Windows interoperation is worse than ever.

Basic Programming With Unicode

On Linux, life is easy. Use UTF-8 for everything. In C/C++, if you want to avoid dealing with the variable-length nature of UTF-8, use the Unicode oriented file functions/classes which will perform automatic conversion between UTF-8 (char*) and UTF-32 (wchar_t*). Everything works well because UTF-8 is able to be manipulated through the functions that were intended for non-Unicode encodings. The C/C++ standard library makes it very difficult to use UTF-16 or UTF-32 directly in files, so using UTF-8 for encoding text files makes a lot of sense.

On Windows, things are harder. Although the non-Unicode Windows 95/98/Me are long dead, their legacy lives on in the form of ANSI code pages and ANSI functions. You should basically treat the ANSI code page/ANSI API as deprecated, only existing for backward compatibility. Newer APIs like the .NET Framework are all based on the Unicode APIs. The big problem in Windows is the C/C++ standard libraries. If you want to avoid using ANSI codepages, you either have to avoid all of the standard library file functions, or else you have to use MSVC for Windows compilation and use the Microsoft-specific extensions.

Windows also has a problem when it comes to UTF-8 encoded text files. There is no really reliable way to detect UTF-8 from other encodings. While specific applications can designate UTF-8 as their standard, a general purpose editor like Notepad is going to struggle. Windows thus makes use of the BOM code (U+FEFF) at the start of Unicode (non-ANSI) files. If you are reading text files in Windows, you should check for the existence of the BOM (which will tell you that the files is UTF-8 (byte sequence 0xef, 0xbb, 0xbf) or UTF-16 (byte sequence 0xff, 0xfe)), and if you are writing files in a non-ANSI encoding, you should prepend them with the BOM so that they will open correctly in generic text editors. However, a lot of UNIX apps will fall apart if they see the BOM, so you’re kind of damned if you do, damned if you don’t.

File I/O using the C standard libraries is not possible on Windows unless you restrict yourself to filenames in the ASCII range. Not a problem for outputting log files, but a big problem if you are letting users choose their own filenames. For console apps, you cannot read Unicode arguments using the standard int main(int argc, char* argv[]) function. The only alternative is to use wmain(int argc, wchar_t* argv[]). Although this is Microsoft only, you have to use it. Let’s take a look at an example. If you are on a Windows box with NTFS, you can easily open up Explorer, create a random text file somewhere, and change the name to “Dฤ.txt”. That second character is a Thai character, and unless you are on a Thai version of Windows, it is not accessible via the C/C++ standard libraries. The following code will not work:

#include <fstream>
#include <stdio.h>

int main(int argc, char* argv[]) {
	std::wfstream reader;
	reader.open(argv[1], std::ios::in);
	...
};

If we try passing “Dฤ.txt” to our program as the first argument, it will receive the string “D?.txt”. Changing the locale with setlocale() does not help, since it does not change the encoding that is passed to our program, it only changes how our program interprets the data it receives. We have to use Microsoft-only wmain and open:

#include <fstream>
#include <stdio.h>

int main(int argc, wchar_t* argv[]) {
	std::wfstream reader;
	reader.open(argv[1], std::ios::in); // Microsoft-only overload
	...
};

Implicit Conversion

Let’s say you have some kind of ANSI driver on Windows, say an ODBC driver. You might be tempted to think that you can simply stuff UTF-8 strings into char*, and as long as we configure our ODBC driver to talk in UTF-8, we’ve created a path for using UTF-8 in Windows. Worse, if you actually try doing this on a Western European Windows (Windows-1252 encoding), it is going to work. However, this will fall apart on many other versions of Windows. The key here is that Windows uses Unicode internally, so your char* string will get converted to wchar_t* as it passes through Windows, then get converted back to char* when it finally gets passed to the ODBC driver. Because the Windows-1252 encoding has clearly defined char <-> wchar_t mappings for every character in the 128 to 255 code range, all of our UTF-8 characters will survive the conversion/back-conversion process unscathed.

As an example, let’s say we have the UTF-8 encoded value of “Ê”. This is 195 138. What Windows sees is two valid Windows-1252 characters “ÃŠ”, which it converts to Unicode then back to Windows-1252 as “ÃŠ” again. If the ODBC driver views this string as UTF-8, it sees “Ê” and everything works.

Now, take the same example but on Japanese Windows. In Shift JIS, 195 is “ﾃ”. However, 138 does not map to a character. It maps to the first byte of a multi-byte character. The ANSI to Unicode conversion is going to dump this character because it is not a valid character, and the conversion back to ANSI will leave us with the single byte string 195. The data that ends up in our ODBC is now broken.

This above pattern is one that I have seen a lot. The big problem is that it works fine on Western European and US (i.e. Windows-1252) code page Windows, and it works fine with every single Unicode character. However, when the dev ships their client off to a Japanese customer, Chinese customer, or Russian customer, they start getting random errors because certain character codes simply cannot survive the traversal across the implicit conversion.

Dealing With Databases

As I mentioned above, using an ANSI ODBC driver on Windows is asking for trouble. This is something that used to be a big problem for mysql, in particular. However, there is a work-around that you can use. If you want to pass UTF-8 or any other encoding to the database without Windows performing any automatic conversions, you can use hexadecimal literals. For example, insert into MyTable (Col1, Col2) values (0xE697A5E69CACE8AA9E, 0xE69687E5AD97); will insert a row with UTF-8 values “日本語” and “文字” for Col1 and Col2. The hex strings following the 0x need to be the hex representation of the binary representation of the string encoded in the encoding of the database column (not the encoding of Windows and not the encoding in the set names xxx mysql command.

Unicode-Based Languages

One of the things you sometimes see on programming forums is people asking for Unicode based languages. Why can’t the language just treat all strings as Unicode? After all, even on Windows these days it’s quite easy to save files in UTF-8, and if UTF-8 is part of the spec of our language, then any IDE produced should just default to UTF-8 and everything is solved, right? Well, not quite. One area where this remains an issue is writing code on Windows and uploading via FTP to run on a UNIX box, the kind of thing you might do as a web developer. Let’s say a client provides us with an include file called “café.inc”, which we save in our NTFS filesystem. In a code file somewhere we write include "café.inc". We run this for testing on our local Windows box and everything works, so we then upload it to our Linux test server. If we used a tool like FTP to upload the file, chances are it’s not going to work. FTP is not aware of encodings, and it’s not going to translate the filename into UTF-8. The final step we need is a specially modified version of FTP that can transform filenames into UTF-8. Surprisingly, a lot of FTP/SFTP/FTPS clients on Windows do not support this (including the FTP client that is included with Windows!). One client that does support it is Filezilla, but you have to go into the advanced properties and select Force UTF-8.

Probably the best solution to overcome problems with encodings for programmers is to simply not include any non-ASCII letters in your source files or filenames. This solves every single problem. If you have text data (or localization data), stick it all in a database with a front-end such as html where you have total control over the encoding. The funny thing is that this takes us full-circle back to the 1970s when ASCII was the only game in town. Technology really does have a hard time escaping its roots.

At this stage, I feel like I’ve rambled on forever. This was supposed to be a short article, and its grown into a monster. I hope I have covered everything, but would love comments if I’ve made errors or neglected anything. As a topic for you to think about, what would you do if you called the wisdigit() function on a character code outside of the ASCII digit range (48 to 57) and got a true value? This is something I might write about another time.

Related Posts

Permalink

10 Comments

Spud said,

April 26, 2014 @ 11:11 am

This was a fabulous article. Really, it was intoxicating. I didn’t know about shift-jis single and double wide glyphs.

Why doesn’t windows have a UTF-8 “ansi” codpage?
And why is python doing this surrogate char crap on linux, and how many bugs / security bugs will that bring? [1] We’ll see how it works… perhaps it’s finally ‘unicode everywhere’ that Almost Works (TM)

[1] http://legacy.python.org/dev/peps/pep-0383/
Gatunka said,

April 26, 2014 @ 1:21 pm

Thanks for feedback. It’s great to get comments.

| Why doesn’t windows have a UTF-8 “ansi” codpage?

This is thing that really would fix everything. I don’t understand why Microsoft doesn’t at least offer it as an option. I can understand that they want to maintain backward compatibility, especially for CJK developers. With all the problems Japanese developers in particular had with Shift JIS, they are still kind of hesitant to change encodings. However, just about any web developer working on a Windows box would benefit from having a UTF-8 ansi codepage. No need for a BOM. No need to worry about file conversions. Uniformity of platform between UNIX and Windows. Perhaps the biggest downside is that devs could end up writing code that doesn’t work on Japanese and Chinese Windows again? I’m not sure of their reasoning.
_ said,

April 26, 2014 @ 2:20 pm

> Why doesn’t windows have a UTF-8 “ansi” codpage?

It does. 65001.
Gatunka said,

April 26, 2014 @ 2:46 pm

The problem is that you can’t set it up as the system ansi codepage. It’s the same as the UTF-16 code pages in Windows. The values are only defined for use with the MultiByteToWideChar function. You cannot configure the system to use UTF-8 encoded file names with the ANSI version of OpenFile() or for storage in FAT32 file systems. It’s not an “ansi” code page. It’s just a code page.

The setting is in Control Panel -> Regional and Languages -> Administrative -> Change System Locale. UTF-8 does not appear in this list. (I’m on Japanese Windows, so the English above might be wrong)
Michael said,

April 26, 2014 @ 8:25 pm

Not sure what point you want to make, but the ranting about Windows is pretty much misguided in this area. Yes, you cannot set UTF-8 as a system ansi codepage (which would be fun), but thats about it.

The C/C++ standard and POSIX had horrible ideas/support for Unicode. For example the grand idea to define LOCALE and the LC_* as environment variables which are process wide. So you cannot have different locales in different threads of your program. So thats just fundamentally broken in a threaded world.

Or the silly idea to not define the encoding of filenames (in fact stuff like NFS and some Filesystems already limit it to a specific encoding). Basically its impossible to do a correct GUI file selection dialog on Unix when you have files with mixed encoding in one directory (which is allowed), not to mention crap like escape sequences and control characters which have created more than one security hole and nameless problems for shell scripts.

So claiming to say that unicode handling with the standard c-library works better for the strange system it was designed for, is kind of obvious. If it didn’t work even there, it would be totally broken. And your claim that wchar_t is always 32-bits on Unix isn’t true either, it may be ANY size, even 8-bits, so it is generally unusable on Unix and usually a different typedef is used to handle portability.

For example have a look at this listing about the terrible state of Unicode support in the C++ standard library: https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11

So in this case the Windows approach of ‘fuck the std library, its crap’ is actually justified and the native Win32 Unicode stuff just works (even with threads and lots of other strangeness around).
Daniël van Eeden said,

April 26, 2014 @ 8:36 pm

In at least one of the examples the include’s are not rendered as expected. It looks like the browser interpets them as HTML tags.

Great article 🙂
Gatunka said,

April 26, 2014 @ 9:12 pm

Thanks, I forgot to escape the angle brackets. Updated now.
don't feed the trolls said,

April 27, 2014 @ 6:25 am

@Spud: If you are talking about this security bug: http://blog.omega-prime.co.uk/?p=107 then it is debunked here https://mail.python.org/pipermail/python-dev/2011-March/110224.html :

> “Yes, if you decode two byte strings from two different encodings, you get different unicode strings. It’s not related to surrogateescape (PEP 383).”

e.g., `’\u4f60\u597d’.encode(‘big5’).decode(‘latin1′)` won’t produce the same Unicode string `’\u4f60\u597d’` back. And it is doesn’t contain any surrogate character i.e., it has nothing to do with PEP 383.

On systems such as Linux that may have almost arbitrary byte sequences in filenames:

>>> import os
>>> os.mkdir(bytearray(range(1, 0x100)).replace(b’/’, b”))

compare paths as bytestrings (os.fsencode() will encode a system Unicode filename back into bytes) on these systems.

There may be other security issues, some of them are discussed in the pep 383 itself http://legacy.python.org/dev/peps/pep-0383/
Memnarch said,

May 6, 2014 @ 2:13 am

Well, Delphi treats all characters within the sourcefiles as unicodecharacters. This allows funny Left/Ride-Overrides in identifiers of working/compiling code or hearts, but it really does everything in unicode.

So year, Delphi is (Since D2009) a truly Unicode based language.
Andreas said,

May 6, 2014 @ 5:18 pm

> Why doesn’t windows have a UTF-8 “ansi” codpage?

Well, the answer is most likely that they simply aren’t doing their job.
Creating useful APIs nowadays just isn’t something on the agenda of the company anymore.

I think as far as improvements in technology are concerned MS is just dead – Their core business used to be selling an implementation of the (Win32-)API. Now it’s just all about milking money from consumers.

Just take a look at professional rugged mobile hardware. This market is pretty much stuck in ancient Versions of WinCE. No useful improvements from MS for years and years – just tons of consumer oriented crapware – completely unusable in an industrial environment. A move away towards Android there will be hard, but most likely inevitable…

RSS feed for comments on this post

GT!Blog

Character Encodings For Modern Programmers

10 Comments

Spud said,

Gatunka said,

_ said,

Gatunka said,

Michael said,

Daniël van Eeden said,

Gatunka said,

don't feed the trolls said,

Memnarch said,

Andreas said,

Pages

Categories

Archives

Meta