Google
 

Colours: Neutral Weird Reload

Links to Unicode code charts

See below for details about UTF8 encoding

Unicode.org in its charts section offers useful links to the Unicode code charts. The only problem with it though, is that the code range of each subset is not listed there, so if you already have a rough or precise idea of the code you're looking for, their set of links is difficult to use. Therefore, I present similar links here, which do show the starting point of the code range.
See also this font tester.


0000 Basic Latin
0080 Latin-1 Supplement
0100 Latin Extended-A
0180 Latin Extended-B
0250 IPA Extensions
02B0 Spacing Modifier Letters
0300 Combining Diacritical Marks
0370 Greek and Coptic
0400 Cyrillic
0500 Cyrillic Supplement
0530 Armenian
0590 Hebrew
0600 Arabic
0700 Syriac
0780 Thaana
0900 Devanagari
0980 Bengali
0A00 Gurmukhi
0A80 Gujarati
0B00 Oriya
0B80 Tamil
0C00 Telugu
0C80 Kannada
0D00 Malayalam
0D80 Sinhala
0E00 Thai
0E80 Lao
0F00 Tibetan
1000 Myanmar
10A0 Georgian
1100 Hangul Jamo
1200 Ethiopic
13A0 Cherokee
1400 Unified Canadian Aboriginal Syllabic
1680 Ogham
16A0 Runic
1700 Tagalog
1720 Hanunoo
1740 Buhid
1760 Tagbanwa
1780 Khmer
1800 Mongolian
1900 Limbu
1950 Tai Le
19E0 Khmer Symbols
1D00 Phonetic Extensions
1E00 Latin Extended Additional
1F00 Greek Extended
2000 General Punctuation
2070 Superscripts and Subscripts
20A0 Currency Symbols
20D0 Combining Marks for Symbols
2100 Letterlike Symbols
2150 Number Forms
2190 Arrows
2200 Mathematical Operators
2300 Miscellaneous Technical
2400 Control Pictures
2440 Optical Character Recognition
2460 Enclosed Alphanumerics
2500 Box Drawing
2580 Block Elements
25A0 Geometric Shapes
2600 Miscellaneous Symbols
2700 Dingbats
27C0 Miscellaneous Mathematical Symbols-A
27F0 Supplemental Arrows-A
2800 Braille Patterns
2900 Supplemental Arrows-B
2980 Miscellaneous Mathematical Symbols-B
2A00 Supplemental Mathematical Operators
2B00 Miscellaneous Symbols and Arrows
2E80 CJK Radicals Supplement
2F00 Kangxi Radicals
2FF0 Ideographic Description Characters
3000 CJK Symbols and Punctuation
3040 Hiragana
30A0 Katakana
3100 Bopomofo
3130 Hangul Compatibility Jamo
3190 Kanbun
31A0 Bopomofo Extended
31F0 Katakana Phonetic Extensions
3200 Enclosed CJK Letters and Months
3300 CJK Compatibility
3400 CJK Unified Ideographs Extension A
4DC0 Yijing Hexagram Symbols
4E00 CJK Unified Ideographs (5MB)
A000 Yi Syllables
A490 Yi Radicals
AC00 Hangul Syllables (7MB)
D800 High Surrogates
DC00 Low Surrogates
E000 Private Use Area
F900 CJK Compatibility Ideographs
FB00 Alphabetic Presentation Forms
FB50 Arabic Presentation Forms-A
FE00 Variation Selectors
FE20 Combining Half Marks
FE30 CJK Compatibility Forms
FE50 Small Form Variants
FE70 Arabic Presentation Forms-B
FF00 Halfwidth and Fullwidth Forms
FFF0 Specials
10000 Linear B Syllabary
10080 Linear B Ideograms
10100 Aegean Numbers
10300 Old Italic
10330 Gothic
10380 Ugaritic
10400 Deseret
10450 Shavian
10480 Osmanya
10800 Cypriot Syllabary
1D000 Byzantine Musical Symbols
1D000 Musical Symbols
1D300 Tai Xuan Jing Symbols
1D400 Mathematical Alphanumeric Symbols
20000 CJK Unified Ideographs Extension B
2F800 CJK Compatibility Ideographs
E0000 Tags
E0100 Variation Selectors Supplement
F0000 Supplementary Private Use Area-A
100000 Supplementary Private Use Area-B


Google
 

UTF-8

Unicode.org is not very clear about the encoding of Unicode scalar values in UTF-8 or UTF-16. They refer to a corrigendum to a document that itself doesn't seem to be online. The corrigendum contains a lot of bureaucratic formalities, which do not seem very relevant to what I wanted to know: how does UTF-8 work?

Therefore I copied (without permission) the relevant part from http://www.unicode.org/versions/corrigendum1.html


Table 3.1. UTF-8 bit distribution
Scalar value UTF-16 1st byte 2nd byte 3rd byte 4th byte
00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx      
00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx    
zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx  
000uuuuu zzzzyyyy
yyxxxxxx
110110ww wwzzzzyy
110111yy yyxxxxxx 
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Where uuuuu = wwww + 1 (to account for addition of 1000016)

For UTF8, the table could be extended to 26, 31, 36 and 42 bit scalars, opening the possibility to encode the dazzling amount of over 4 trillion (exactly 4 398 046 511 104) different characters. That's US trillions, or European billions. In other words, more than 4 million million characters. Even if in the course of history thousands of new languages like Chinese, Korean and Japanese might develop, each with its own iconographic script, there would still be plenty of room.

Scalar value 1st byte 2nd byte 3rd byte 4th byte 5th byte 6th byte 7th byte 8th byte
000000tt uuuuuuzz zzzzyyyy yyxxxxxx 111110tt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx      
0stttttt uuuuuuzz zzzzyyyy yyxxxxxx 1111110s 10tttttt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx    
0000ssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx 11111110 10ssssss 10tttttt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx  
000000rr rrrrssss sstttttt uuuuuuzz zzzzyyyy yyxxxxxx 11111111 10rrrrrr 10ssssss 10tttttt 10uuuuuu 10zzzzzz 10yyyyyy 10xxxxxx

To me as a sometimes rather bit-oriented person, this much is clear enough, it tells me all I wanted to know about UTF8 encoding.

See also: Alan Wood, and Unicode Transformation Formats: UTF-8 & Co by Roman Czyborra, which also covers SCSU (Standard Compression Scheme for Unicode).

And here's the RFC that defines UTF-8. It says "The octet values FE and FF never appear", which means that my 36 and 42 bit scalars above cannot be encoded. Not a big problem.
See also UTF-8.
And this tells us how Ken Thompson invented UTF-8, cheered on by Rob Pike.