Bitwise explanation of UTF8 and UTF16

Unicode.org is not very clear about the encoding of Unicode scalar values in UTF-8 or UTF-16. They refer to a corrigendum to a document that itself doesn't seem to be online. The corrigendum contains a lot of bureaucratic formalities, which do not seem very relevant to what I wanted to know: how does UTF-8 work?

Therefore I copied (without permission) the relevant part from http://www.unicode.org/versions/corrigendum1.html


Table 3.1. UTF-8 bit distribution
Scalar value UTF-16 1st byte 2nd byte 3rd byte 4th byte
00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx      
00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx    
zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx  
000uuuuu zzzzyyyy
yyxxxxxx
110110ww wwzzzzyy
110111yy yyxxxxxx 
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Where uuuuu = wwww + 1 (to account for addition of 1000016)

For UTF8, the table could be extended to 26, 31, 36 and 42 bit scalars, opening the possibility to encode the dazzling amount of over 4 trillion (exactly 4 398 046 511 104) different characters. That's US trillions, or European billions. In other words, more than 4 million million characters. Even if in the course of history thousands of new languages like Chinese, Korean and Japanese might develop, each with its own iconographic script, there would still be plenty of room.

Scalar value 1st byte 2nd byte 3rd byte 4th byte 5th byte 6th byte 7th byte 8th byte
000000tt uuuuuuzz zzzzyyyy yyxxxxxx 111110tt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx      
0stttttt uuuuuuzz zzzzyyyy yyxxxxxx 1111110s 10tttttt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx    
0000ssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx 11111110 10ssssss 10tttttt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx  
000000rr rrrrssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx 11111111 10rrrrrr 10ssssss 10tttttt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx

To me as a sometimes rather bit-oriented person, this much is clear enough, it tells me all I wanted to know about UTF8 encoding.

See also: Alan Wood, and Unicode Transformation Formats: UTF-8 & Co by Roman Czyborra, which also covers SCSU (Standard Compression Scheme for Unicode).

And here's the RFC that defines UTF-8. It says "The octet values FE and FF never appear", which means that my 36 and 42 bit scalars above cannot be encoded. Not a big problem.
See also UTF-8.
And this tells us how Ken Thompson invented UTF-8, cheered on by Rob Pike.

Colours: Neutral Weird No preference Reload

Vostre annuncio ci?

Your ad here?


E-mail:
usator: commercial,
dominio: rudhar puncto com

Linguas de correspondentia:
nl, ia, en, de, pt