Bitwise explanation of UTF8 and UTF16

26 January 2002

Unicode.org is not very clear about the encoding of Unicode scalar values in UTF-8 or UTF-16. They refer to a corrigendum to a document that itself doesn't seem to be online. The corrigendum contains a lot of bureaucratic formalities, which do not seem very relevant to what I wanted to know: how does UTF-8 work?

Therefore I copied (without permission) the relevant part from https://www.unicode.org/versions/corrigendum1.html

Table 3.1. UTF-8 bit distribution
Scalar value	UTF-16	1st byte	2nd byte	3rd byte	4th byte
`00000000 0xxxxxxx`	`00000000 0xxxxxxx`	`0xxxxxxx`
`00000yyy yyxxxxxx`	`00000yyy yyxxxxxx`	`110yyyyy`	`10xxxxxx`
`zzzzyyyy yyxxxxxx`	`zzzzyyyy yyxxxxxx`	`1110zzzz`	`10yyyyyy`	`10xxxxxx`
`000uuuuu zzzzyyyy yyxxxxxx`	`110110ww wwzzzzyy 110111yy yyxxxxxx`	`11110uuu`	`10uuzzzz`	`10yyyyyy`	`10xxxxxx`

Where uuuuu = wwww + 1 (to account for addition of 10000₁₆)

For UTF8, the table could be extended to 26, 31, 36 and 42 bit scalars, opening the possibility to encode the dazzling amount of over 4 trillion (exactly 4 398 046 511 104) different characters. That's US trillions, or European billions. In other words, more than 4 million million characters. Even if in the course of history thousands of new languages like Chinese, Korean and Japanese might develop, each with its own iconographic script, there would still be plenty of room.

Scalar value	1st byte	2nd byte	3rd byte	4th byte	5th byte	6th byte	7th byte	8th byte
`000000tt uuuuuuzz zzzzyyyy yyxxxxxx`	`111110tt`	`10uuuuuu`	`10zzzzzz`	`10yyyyyy`	`10xxxxxx`
`0stttttt uuuuuuzz zzzzyyyy yyxxxxxx`	`1111110s`	`10tttttt`	`10uuuuuu`	`10zzzzzz`	`10yyyyyy`	`10xxxxxx`
`0000ssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx`	`11111110`	`10ssssss`	`10tttttt`	`10uuuuuu`	`10zzzzzz`	`10yyyyyy`	`10xxxxxx`
`000000rr rrrrssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx`	`11111111`	`10rrrrrr`	`10ssssss`	`10tttttt`	`10uuuuuu`	`10zzzzzz`	`10yyyyyy`	`10xxxxxx`

To me as a sometimes rather bit-oriented person, this much is clear enough, it tells me all I wanted to know about UTF8 encoding.

See also: Alan Wood, and Unicode Transformation Formats: UTF-8 & Co by Roman Czyborra, which also covers SCSU (Standard Compression Scheme for Unicode).

And here's the RFC that defines UTF-8. It says "The octet values FE and FF never appear", which means that my 36 and 42 bit scalars above cannot be encoded. Not a big problem.
See also UTF-8.
And this tells us how Ken Thompson invented UTF-8, cheered on by Rob Pike.