See below for details about UTF8 encoding
Unicode.org in its
charts section offers
useful links to the Unicode code charts. The only problem
with it though, is that the code range of each subset is not
listed there, so if you already have a rough or precise idea
of the code you're looking for, their set of links is difficult
to use. Therefore, I present similar links here, which do
show the starting point of the code range.
See also this font tester.
0000 Basic Latin
0080 Latin-1 Supplement
0100 Latin Extended-A
0180 Latin Extended-B
0250 IPA Extensions
02B0 Spacing Modifier Letters
0300 Combining Diacritical Marks
0370 Greek and Coptic
0400 Cyrillic
0500 Cyrillic Supplement
0530 Armenian
0590 Hebrew
0600 Arabic
0700 Syriac
0780 Thaana
0900 Devanagari
0980 Bengali
0A00 Gurmukhi
0A80 Gujarati
0B00 Oriya
0B80 Tamil
0C00 Telugu
0C80 Kannada
0D00 Malayalam
0D80 Sinhala
0E00 Thai
0E80 Lao
0F00 Tibetan
1000 Myanmar
10A0 Georgian
1100 Hangul Jamo
1200 Ethiopic
13A0 Cherokee
1400 Unified Canadian Aboriginal Syllabic
1680 Ogham
16A0 Runic
1700 Tagalog
1720 Hanunoo
1740 Buhid
1760 Tagbanwa
1780 Khmer
1800 Mongolian
1900 Limbu
1950 Tai Le
19E0 Khmer Symbols
1D00 Phonetic Extensions
1E00 Latin Extended Additional
1F00 Greek Extended
2000 General Punctuation
2070 Superscripts and Subscripts
20A0 Currency Symbols
20D0 Combining Marks for Symbols
2100 Letterlike Symbols
2150 Number Forms
2190 Arrows
2200 Mathematical Operators
2300 Miscellaneous Technical
2400 Control Pictures
2440 Optical Character Recognition
2460 Enclosed Alphanumerics
2500 Box Drawing
2580 Block Elements
25A0 Geometric Shapes
2600 Miscellaneous Symbols
2700 Dingbats
27C0 Miscellaneous Mathematical Symbols-A
27F0 Supplemental Arrows-A
2800 Braille Patterns
2900 Supplemental Arrows-B
2980 Miscellaneous Mathematical Symbols-B
2A00 Supplemental Mathematical Operators
2B00 Miscellaneous Symbols and Arrows
2E80 CJK Radicals Supplement
2F00 Kangxi Radicals
2FF0 Ideographic Description Characters
3000 CJK Symbols and Punctuation
3040 Hiragana
30A0 Katakana
3100 Bopomofo
3130 Hangul Compatibility Jamo
3190 Kanbun
31A0 Bopomofo Extended
31F0 Katakana Phonetic Extensions
3200 Enclosed CJK Letters and Months
3300 CJK Compatibility
3400 CJK Unified Ideographs Extension A
4DC0 Yijing Hexagram Symbols
4E00 CJK Unified Ideographs (5MB)
A000 Yi Syllables
A490 Yi Radicals
AC00 Hangul Syllables (7MB)
D800 High Surrogates
DC00 Low Surrogates
E000 Private Use Area
F900 CJK Compatibility Ideographs
FB00 Alphabetic Presentation Forms
FB50 Arabic Presentation Forms-A
FE00 Variation Selectors
FE20 Combining Half Marks
FE30 CJK Compatibility Forms
FE50 Small Form Variants
FE70 Arabic Presentation Forms-B
FF00 Halfwidth and Fullwidth Forms
FFF0 Specials
10000 Linear B Syllabary
10080 Linear B Ideograms
10100 Aegean Numbers
10300 Old Italic
10330 Gothic
10380 Ugaritic
10400 Deseret
10450 Shavian
10480 Osmanya
10800 Cypriot Syllabary
1D000 Byzantine Musical Symbols
1D000 Musical Symbols
1D300 Tai Xuan Jing Symbols
1D400 Mathematical Alphanumeric Symbols
20000 CJK Unified Ideographs Extension B
2F800 CJK Compatibility Ideographs
E0000 Tags
E0100 Variation Selectors Supplement
F0000 Supplementary Private Use Area-A
100000 Supplementary Private Use Area-B
Unicode.org is not very clear about the encoding of Unicode scalar values in UTF-8 or UTF-16. They refer to a corrigendum to a document that itself doesn't seem to be online. The corrigendum contains a lot of bureaucratic formalities, which do not seem very relevant to what I wanted to know: how does UTF-8 work?
Therefore I copied (without permission) the relevant part from http://www.unicode.org/versions/corrigendum1.html
| Scalar value | UTF-16 | 1st byte | 2nd byte | 3rd byte | 4th byte |
|---|---|---|---|---|---|
00000000 0xxxxxxx |
00000000 0xxxxxxx |
0xxxxxxx |
|||
00000yyy yyxxxxxx |
00000yyy yyxxxxxx |
110yyyyy |
10xxxxxx |
||
zzzzyyyy yyxxxxxx |
zzzzyyyy yyxxxxxx |
1110zzzz |
10yyyyyy |
10xxxxxx |
|
000uuuuu zzzzyyyy |
110110ww wwzzzzyy |
11110uuu |
10uuzzzz |
10yyyyyy |
10xxxxxx |
Where uuuuu = wwww + 1 (to account for addition of 1000016)
For UTF8, the table could be extended to 26, 31, 36 and 42 bit scalars, opening the possibility to encode the dazzling amount of over 4 trillion (exactly 4 398 046 511 104) different characters. That's US trillions, or European billions. In other words, more than 4 million million characters. Even if in the course of history thousands of new languages like Chinese, Korean and Japanese might develop, each with its own iconographic script, there would still be plenty of room.
| Scalar value | 1st byte | 2nd byte | 3rd byte | 4th byte | 5th byte | 6th byte | 7th byte | 8th byte |
|---|---|---|---|---|---|---|---|---|
000000tt uuuuuuzz zzzzyyyy yyxxxxxx |
111110tt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
|||
0stttttt uuuuuuzz zzzzyyyy yyxxxxxx |
1111110s |
10tttttt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
||
0000ssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx |
11111110 |
10ssssss |
10tttttt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
|
000000rr rrrrssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx |
11111111 |
10rrrrrr |
10ssssss |
10tttttt |
10uuuuuu |
10zzzzzz |
10yyyyyy |
10xxxxxx |
To me as a sometimes rather bit-oriented person, this much is clear enough, it tells me all I wanted to know about UTF8 encoding.
See also: Alan Wood, and Unicode Transformation Formats: UTF-8 & Co by Roman Czyborra, which also covers SCSU (Standard Compression Scheme for Unicode).
And here's the RFC that
defines UTF-8. It says "The octet values FE and FF never appear", which means
that my 36 and 42 bit scalars above cannot be encoded. Not a big problem.
See also UTF-8.
And this tells
us how Ken Thompson invented UTF-8, cheered on by Rob Pike.