Unicode.org is not very clear about the encoding of Unicode scalar values in UTF-8 or UTF-16. They refer to a corrigendum to a document that itself doesn't seem to be online. The corrigendum contains a lot of bureaucratic formalities, which do not seem very relevant to what I wanted to know: how does UTF-8 work?
Therefore I copied (without permission) the relevant part from https://www.unicode.org/versions/corrigendum1.html
Scalar value | UTF-16 | 1st byte | 2nd byte | 3rd byte | 4th byte |
---|---|---|---|---|---|
00000000 0xxxxxxx |
00000000 0xxxxxxx |
0xxxxxxx |
|||
00000yyy yyxxxxxx |
00000yyy yyxxxxxx |
110yyyyy |
10xxxxxx |
||
zzzzyyyy yyxxxxxx |
zzzzyyyy yyxxxxxx |
1110zzzz |
10yyyyyy |
10xxxxxx |
|
000uuuuu zzzzyyyy |
110110ww wwzzzzyy |
11110uuu |
10uuzzzz |
10yyyyyy |
10xxxxxx |
Where uuuuu = wwww + 1 (to account for addition of 1000016)
For UTF8, the table could be extended to 26, 31, 36 and 42 bit scalars, opening the possibility to encode the dazzling amount of over 4 trillion (exactly 4 398 046 511 104) different characters. That's US trillions, or European billions. In other words, more than 4 million million characters. Even if in the course of history thousands of new languages like Chinese, Korean and Japanese might develop, each with its own iconographic script, there would still be plenty of room.
Scalar value | 1st byte | 2nd byte | 3rd byte | 4th byte | 5th byte | 6th byte | 7th byte | 8th byte |
---|---|---|---|---|---|---|---|---|
000000tt uuuuuuzz |
111110tt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
|||
0stttttt uuuuuuzz |
1111110s |
10tttttt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
||
0000ssss sstttttt |
11111110 |
10ssssss |
10tttttt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
|
000000rr rrrrssss |
11111111 |
10rrrrrr |
10ssssss |
10tttttt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
To me as a sometimes rather bit-oriented person, this much is clear enough, it tells me all I wanted to know about UTF8 encoding.
See also: Alan Wood, and Unicode Transformation Formats: UTF-8 & Co by Roman Czyborra, which also covers SCSU (Standard Compression Scheme for Unicode).
And here's the RFC that
defines UTF-8. It says "The octet values FE and FF never appear", which means
that my 36 and 42 bit scalars above cannot be encoded. Not a big problem.
See also UTF-8.
And this tells
us how Ken Thompson invented UTF-8, cheered on by Rob Pike.