Character Codes and Encodings

Posted by: Keith Brown

Introduction

We're going to explain a few things about character codes and encodings here. Even though there is increasingly only Unicode and UTF-8 these days, beginning developers continue to have a hard time.

Links:
https://www.unicode.org
https://en.wikipedia.org/wiki/byte_order_mark
https://en.wikipedia.org/wiki/ascii
https://en.wikipedia.org/wiki/C0_and_C1_control_codes

The Basics

First of all, some terminology:

a character repertoire is a set of distinct characters
a character code is a one-to-one mapping between a set of non-negative integers and the set of characters in a character repertoire
a code space is the set of integer elements in a character code
a code point is one of the integer elements in a character code
a character encoding is a one-to-one mapping between the code points in a character code and a sequence of one or more bytes/octets

Note that an octet is an unambiguous term for a group of 8 bits. The term byte cam be ambiguous - in some cases it refers to a group of 8 bits and in others it refers to a storage location that is 8-bits wide.

What people most often confuse is character code and character encoding. But note that a character code (sometimes informally referred to as a "character set") is simply a definition - it has nothing to do with how characters (more precisely, code points) in the code are represented in the memory or disk storage of a computer, which is where a character encoding comes into play.

The confusion probably comes from the fact that the encodings for the earlier widely-used 128-character ASCII code and the 256-character ISO-8859-n codes are such that each code point is mapped to (encoded as) a one-byte sequence in which the numeric (binary) value of the byte is equal to the value of the code point. Such a so-called "trivial" encoding was natural and sufficient for these character codes of 256 or fewer characters, and there was no need for any other encodings. Thus "ASCII", for example, indicated both a code and an encoding.

While we're on the topic of earlier character codes:

code pages are non-standard IBM/Microsoft character codes and their trivial encodings
cp-437 aka US code page maps the PC-8 or IBM extended ASCII character repertoire used with MS-DOS on the original IBM PC
cp-850 aka Multilingual (Latin I) code page replaces certain box-drawing and other characters in cp-437 with additional western european characters; it is the default on Windows PCs sold in Europe
cp-1252 alias Windows-1252 or Windows-Latin contains the euro symbol (€) but cp-437 and cp-850 do not

The Unicode Character Code

Unicode is a character code that defines a codespace of 1,114,112 code points in the range 0h to 10ffffh. This can be thought of as [00]0000h to [10]ffffh, because the codespace is divided into 17 planes numbered 00h to 10h of 65536 code points each. Currently only 6 planes are in use in some way. Unicode is identical to the ISO-10646 Universal Character Set aka UCS (sometimes referred to as the Universal Coded Character Set).

Plane 00h aka the Basic Multilingual Plane aka the BMP contains Unicode code points 0-65535 i.e. 000000h-00ffffh. This plane contains characters for almost all modern languages as well as many special characters, and was designed to unify all previous character codes. Most of the code points in the BMP are allocated to asian language characters.

In fact, code points d800h-dfffh in the BMP are not mapped to characters, but are reserved for use in a particular encoding of Unicode known as UTF-16 (more later), leaving 1,114,112 - 2048 = 1,112,064 code points.

The higher planes are jokingly called the astral planes. Plane 01 contains code points 010000h to 01ffffh, plane 02 code points 020000h to 02ffffh, etc.

Characters in each plane are allocated to named blocks of related characters which are always a multiple of 16 in size. The first three blocks in the BMP each have 128 characters and are named Basic Latin, Latin-1 Supplement, and Latin Extended-A. These are followed by Latin Extended-B with 208 characters and IPA Extensions with 96 characters.

The Basic Latin block is identical to the ASCII character code. The Basic Latin and Latin-1 Supplement blocks together are identical to the ISO-8859-1 aka Latin-1 character code.

Unicode also includes control characters. The C0 control range consists of code points 0h-1fh and 7fh in the Basic Latin block, and the C1 control range consists of code points 80h-9f in the Latin-1 Supplement block

Unicode characters are often referred to using the notation u+«code point in hex» with at least four hex digits e.g. u+20ac for the euro symbol. Unicode characters also have an official name, e.g. the name of A is LATIN CAPITAL LETTER A. Your system's "charmap" program is a good place to see some of this information.

Links (Unicode characters by class/category):
http://www.fileformat.info/info/unicode/category/index.htm
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Encodings of Unicode

There are a number of different encodings of Unicode. Approaches to encoding begin with the fact that 10ffffh can be represented in 21 bits i.e. 1 0000 1111 1111 1111 1111b (contrary to popular belief, 32-bits are not needed). A trivial fixed-length encoding would thus require at least 3-bytes per code point to encode the full Unicode character code.

Some encodings do not encode the full character code. In some cases, this is based on the view that the only character code is Unicode and earlier character codes are simply different encodings of Unicode. According to this view, ASCII and ISO-8859-n are trivial one-byte encodings that only encode a subset of Unicode (which of course is a superset of ASCII and ISO-8859-n when they are viewed as character codes).

Fixed-Length Encodings

UTF-32 is a trivial 4-byte encoding of the full Unicode character code. UTF stands for Unicode Transformation Format.

UCS-4 is also a trivial 4-byte encoding of the full Unicode character code. In fact, today it is identical to UTF-32 and considered to be the canonical form for representation of characters in ISO 10646. Originally, the UCS was larger than Unicode and UCS-4 was a trivial encoding of it.

UCS-2 was a trivial 2-byte encoding of the BMP, i.e the code points 0-65535 i.e. 000000h-00ffffh for which only 16-bits are needed. It is now considered obsolete.

Variable-Length Encodings

Variable-length (non-trivial) encodings attempt to encode Unicode with less overhead than a fixed-length encoding.

UTF-16 is a variable-length encoding of Unicode that uses two or four bytes per code point depending on code point position. For code points in the BMP, it is a trivial 2-byte encoding identical to UCS-2.

UTF-16 encodes code points outside the BMP, i.e. in the range 10000h-10ffffh, as a 4-byte surrogate pair. It does this by first subtracting 10000h from the code point to "normalize" it to a number in the range 0h-fffffh, i.e. a 20-bit number. It then adds the high-order 10 bits to d800h to form the leading 2-byte surrogate, and the low-order 10 bits to dc00h to form the trailing 2-byte surrogate. The two 2-byte surrogates are thus in the non-overlapping ranges d800f-dbffh and dc00h-dfffh. This is the reason that Unicode reserves the code points d800h-dfffh in the BMP for UTF-16 encoding - if these code points were 2-byte trivially encoded like the rest of the BMP with UTF-16, there would no way to distinguish them from 2-byte surrogates.

UTF-8 is a variable-length encoding of Unicode that uses one to four bytes per code point depending on code point position. A code point in the range 0h-7fh is encoded using one byte whose high-order bit is 0. UTF-8 is thus identical to ASCII for code points 0h-7fh, which is one of its advantages (ASCII bytes can simply be interpreted as UTF-8 bytes). Each code point above 7fh is encoded using a multi-byte sequence in which the leading byte has high-order "control" bits of 110, 1110 or 11110 in which the number of 1's indicates the number of bytes in the sequence. The trailing bytes always have high-order control bits of 10. All non-control bits are used to encode the code point.

UTF-8 is, in fact, the most popular way to encode Unicode due to its relationship to ASCII and its efficiency with latin characters. UTF-8's structure also means that there is no "endianess" and therefore no need for a byte order mark (BOM) character (more later). UTF-8 is also self-synchronizing i.e. the start of a sequence can be detected in the middle of a stream without inspecting earlier bytes

Encoding a code point in UTF-8:

   code point         required                  sequence
 from       to     number of bits   byte 1   byte 2   byte 3   byte 4
u+0000    u+007F         7          0xxxxxxx
u+0080    u+07FF        11          110xxxxx 10xxxxxx
u+0800    u+FFFF        16          1110xxxx 10xxxxxx 10xxxxxx
u+10000   u+10FFFF      21          11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, the € code point is 20ach or 0010 0000 1010 1100b which requires 16 bits; this requires a 3-byte UTF-8 sequence of 1110|0010 10|000010 10|101100 = e2h 82h ach

Decoding a UTF-8 sequence:

            sequence                      used          code point
byte 1   byte 2   byte 3   byte 4    number of bits   from       to      
0xxxxxxx                                   7         u+0000    u+007F   
110xxxxx 10xxxxxx                         11         u+0080    u+07FF   
1110xxxx 10xxxxxx 10xxxxxx                16         u+0800    u+FFFF   
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx       21         u+10000   u+10FFFF

Endianess

Multi-byte encodings (in contrast to single-byte encodings like ASCII) must in some cases indicate endianess, i.e. the order of bytes within each byte sequence.

Little-endian (LE) means that low-order bytes precede high-order bytes as the bytes are read from lower/earlier to higher/later locations in memory, a file, or a communications stream (Intel CPUs, for example, use little-endian memory addressing). Big-endian (BE) is the reverse of little-endian.

The Unicode Byte Order Mark aka BOM character u+feff (or u+0000feff in 4-byte encodings) is sometimes used with these encodings to indicate the endianess of the bytes about to be read. ffh followed by feh (i.e. the low order byte first) indicates little-endian order. feh followed by ffh indicates big-endian order.

The Unicode Replacement Character

The Unicode code point u+fffd corresponding to a question mark inside a diamond character (�) is the Replacement character. It is used to represent an unknown character. For example, this character is often displayed when a character is foreign to the encoding (invalid).

Unicode in Programming Languages

Some programming languages recognize Unicode encodings with regard to character strings (others are agnostic). Python 3, for example, uses UTF-32 internally for its str type. These languages also typically support Unicode character escape sequences that look similar to the Unicode u+ notation. For example, Python 3 supports the forms \u«code point in 4 hex digits» and \U«code point in 8 hex digits» e.g.\u20ac. It also supports character names with \N{«character name»} e.g. \N{EURO SIGN} (case-insensitive). We can get a character's code point in decimal using the ord function.

>>> ord("€")
8364
>>> hex(ord("€"))
'0x20ac'
>>> 
>>> print("\u20ac")
€
>>> print("\U000020ac")
€
>>> print("\N{EURO SIGN}")
€

Careful with terminology in Python 3, however. Yes, UTF-32 is a character encoding used internally by Python 3, but Python 3 uses the term encoding to mean converting an object of type str (a "string") to an object of type bytes (a "byte string") according to a particular encoding (even UTF-32). The term decoding is used to convert a byte string to a string.

>>> "h€llo".encode("utf-8")
b'h\xe2\x82\xacllo'
>>> 
>>> b'h\xe2\x82\xacllo'.decode("utf-8")
'h€llo'
>>> 
>>> "h€llo".encode("utf-32-le")
b'h\x00\x00\x00\xac \x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'
>>> 
>>> b'h\x00\x00\x00\xac \x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00'.decode("utf-32-le")
'h€llo'

Note that to escape characters means to prefix them with a certain character in order to change their meaning. The purpose is either to (1) change special meaning to normal meaning (e.g. in rm my\ file in Bash the escape character \ changes the special meaning of space as a Bash metacharacter to its normal meaning as an alphabetic character and word separator); or (2) to change normal meaning to special meaning (e.g. in print("hello world\n") in Python the escape character \ changes the normal meaning of n as an alphabetic character to its special meaning in Python of the linefeed character). An escape sequence is a sequence of characters beginning with an escape character. The word escape is used in the sense that the character "escapes" from the usual processing

And there we are. A few things about character codes and encodings.

Authors

Keith Brown (13)

Feeds

RSS / Atom