UTF8 in context

12 April 2020

Over 11.5 years ago, in September 2008, I got briefly interested in the pronunciation of the Danish language. I’m still fascinated by it, but I won’t attempt to learn it. Too difficult.

Wikipedia has a very detailed description. It contains IPA, i.e. symbols from the International Phonetic Alphabet. I am rather well versed in that alphabet, but many of the less usual symbols, especially where it comes to diacritics, I do not know by heart. So I wanted to look them up. But Googling is difficult, as is comparing them visually, seeing they are rather small, and there are several, that sometimes look quite similar.

What I usually do in such a situation: find out what exactly a page says, what is the underlying code, at the byte level if necessary. In the past, in older Unixes, and in MSDOS and 16- and 32-bits Windowses with Unix tools installed on them, my idiom was:
od -ch:
octal dump, show characters (c) and their hexadecimal encoding (h). In recent years, they changed that, and no longer show the hex code of each character, but that of 16-bits "small words" or something. Not practical. To get the old behaviour, the new incantation is
od -c -tx1,
i.e. show characters, and type is hexadecimal 1 (byte).

I don’t like it, but so be it. Anyway, the hexadecimal codes still aren’t clarifying, because there is a multilevel encoding going on: the IPA is a symbol set, so in a sense an encoding, which in Wikipedia and most other websites nowadays is encoded in the much larger and universal encoding Unicode, which can contain at most 17*(2^16) = 1114112 integer numbers, each of which can in theory (and some also in practice) represent a symbol from one of the world’s languages and scripts, both taken in a wide sense of the word. So all the IPA symbols are also in there, somewhere.

The Unicode integer numbers, the so-called scalars, in turn are again encoded, in UTF8, which turns them into sequences of one, two, three or at most four eight-bit bytes.

So even if the underlying byte codes are available, to see what it says, the reverse steps of decoding are necessary. Computers can do that much better than humans, and that’s why I wrote my little program, that this article is about.

The C source code is here for download, and here for easier perusal. Use is made of some UTF tool functions (utftools) which I wrote, and published here. There is even a Makefile, but it only works if you have the same directory structure that I have, which you probably haven’t. I should maybe be putting this in github or something, as a library, but I don’t know how github works, and don’t feel like finding out. Lazy, easily distracted and too full of other plans.

Instead of explaining how the program works (I hardly understand that myself anymore, as I experienced when fixing a nasty bug recently), I will show examples of its output.

OK, so we were on about Danish phonology. Or I was, anyway. At the end of the aforementioned Wikipedia article, there is this example phrase:
Nordenvinden og solen kom engang i strid om, hvem af dem der var den stærkeste.

Its pronunciation is given as:
[ˈnoɐ̯ɐnˌve̝nˀn̩ ʌ ˈsoˀl̩n kʰʌm e̝ŋˈkɑŋˀ i ˈstʁiðˀ ˈʌmˀ ˈvemˀ ˈæ pm̩ tɑ vɑ tn̩ ˈstɛɐ̯kəstə]. (Source: Nina Grønnum, 1998, page 104.)

Yeah, right. What does all that mean? We can find out by copying this text, running my program in the command line, pasting the text, and closing it by a newline and a ctrl-d. The result is:

000000 0x5b-cb-88-6e 0x00005b: [.no...n.ve.n.n. . .so.l.n k..m e...k...
000001 0xcb-88-6e-6f 0x0002c8: .no...n.ve.n.n. . .so.l.n k..m e...k...
000003 0x6e-6f-c9-90 0x00006e: no...n.ve.n.n. . .so.l.n k..m e...k... i
000004 0x6f-c9-90-cc 0x00006f: o...n.ve.n.n. . .so.l.n k..m e...k... i
000005 0xc9-90-cc-af 0x000250: ...n.ve.n.n. . .so.l.n k..m e...k... i .
000007 0xcc-af-c9-90 0x00032f: ..n.ve.n.n. . .so.l.n k..m e...k... i .s
000009 0xc9-90-6e-cb 0x000250: .n.ve.n.n. . .so.l.n k..m e...k... i .st
000011 0x6e-cb-8c-76 0x00006e: n.ve.n.n. . .so.l.n k..m e...k... i .st.
000012 0xcb-8c-76-65 0x0002cc: .ve.n.n. . .so.l.n k..m e...k... i .st.i
000014 0x76-65-cc-9d 0x000076: ve.n.n. . .so.l.n k..m e...k... i .st.i.
000015 0x65-cc-9d-6e 0x000065: e.n.n. . .so.l.n k..m e...k... i .st.i..
000016 0xcc-9d-6e-cb 0x00031d: .n.n. . .so.l.n k..m e...k... i .st.i..
000018 0x6e-cb-80-6e 0x00006e: n.n. . .so.l.n k..m e...k... i .st.i.. .
000019 0xcb-80-6e-cc 0x0002c0: .n. . .so.l.n k..m e...k... i .st.i.. ..
000021 0x6e-cc-a9-20 0x00006e: n. . .so.l.n k..m e...k... i .st.i.. ..m
000022 0xcc-a9-20-ca 0x000329: . . .so.l.n k..m e...k... i .st.i.. ..m.
000024 0x20-ca-8c-20 0x000020:  . .so.l.n k..m e...k... i .st.i.. ..m.
000025 0xca-8c-20-cb 0x00028c: . .so.l.n k..m e...k... i .st.i.. ..m. .
000027 0x20-cb-88-73 0x000020:  .so.l.n k..m e...k... i .st.i.. ..m. .v
000028 0xcb-88-73-6f 0x0002c8: .so.l.n k..m e...k... i .st.i.. ..m. .ve
000030 0x73-6f-cb-80 0x000073: so.l.n k..m e...k... i .st.i.. ..m. .vem
000031 0x6f-cb-80-6c 0x00006f: o.l.n k..m e...k... i .st.i.. ..m. .vem.
000032 0xcb-80-6c-cc 0x0002c0: .l.n k..m e...k... i .st.i.. ..m. .vem.
000034 0x6c-cc-a9-6e 0x00006c: l.n k..m e...k... i .st.i.. ..m. .vem. .
000035 0xcc-a9-6e-20 0x000329: .n k..m e...k... i .st.i.. ..m. .vem. ..
000037 0x6e-20-6b-ca 0x00006e: n k..m e...k... i .st.i.. ..m. .vem. ..
000038 0x20-6b-ca-b0 0x000020:  k..m e...k... i .st.i.. ..m. .vem. .. p
000039 0x6b-ca-b0-ca 0x00006b: k..m e...k... i .st.i.. ..m. .vem. .. pm
000040 0xca-b0-ca-8c 0x0002b0: ..m e...k... i .st.i.. ..m. .vem. .. pm.
000042 0xca-8c-6d-20 0x00028c: .m e...k... i .st.i.. ..m. .vem. .. pm.
000044 0x6d-20-65-cc 0x00006d: m e...k... i .st.i.. ..m. .vem. .. pm. t
000045 0x20-65-cc-9d 0x000020:  e...k... i .st.i.. ..m. .vem. .. pm. t.
000046 0x65-cc-9d-c5 0x000065: e...k... i .st.i.. ..m. .vem. .. pm. t.
000047 0xcc-9d-c5-8b 0x00031d: ...k... i .st.i.. ..m. .vem. .. pm. t. v
000049 0xc5-8b-cb-88 0x00014b: ..k... i .st.i.. ..m. .vem. .. pm. t. v.
000051 0xcb-88-6b-c9 0x0002c8: .k... i .st.i.. ..m. .vem. .. pm. t. v.
000053 0x6b-c9-91-c5 0x00006b: k... i .st.i.. ..m. .vem. .. pm. t. v. t
000054 0xc9-91-c5-8b 0x000251: ... i .st.i.. ..m. .vem. .. pm. t. v. tn
000056 0xc5-8b-cb-80 0x00014b: .. i .st.i.. ..m. .vem. .. pm. t. v. tn.
000058 0xcb-80-20-69 0x0002c0: . i .st.i.. ..m. .vem. .. pm. t. v. tn.
000060 0x20-69-20-cb 0x000020:  i .st.i.. ..m. .vem. .. pm. t. v. tn. .
000061 0x69-20-cb-88 0x000069: i .st.i.. ..m. .vem. .. pm. t. v. tn. .s
000062 0x20-cb-88-73 0x000020:  .st.i.. ..m. .vem. .. pm. t. v. tn. .st
000063 0xcb-88-73-74 0x0002c8: .st.i.. ..m. .vem. .. pm. t. v. tn. .st.
000065 0x73-74-ca-81 0x000073: st.i.. ..m. .vem. .. pm. t. v. tn. .st..
000066 0x74-ca-81-69 0x000074: t.i.. ..m. .vem. .. pm. t. v. tn. .st...
000067 0xca-81-69-c3 0x000281: .i.. ..m. .vem. .. pm. t. v. tn. .st...k
000069 0x69-c3-b0-cb 0x000069: i.. ..m. .vem. .. pm. t. v. tn. .st...k.
000070 0xc3-b0-cb-80 0x0000f0: .. ..m. .vem. .. pm. t. v. tn. .st...k.s
000072 0xcb-80-20-cb 0x0002c0: . ..m. .vem. .. pm. t. v. tn. .st...k.st
000074 0x20-cb-88-ca 0x000020:  ..m. .vem. .. pm. t. v. tn. .st...k.st.
000075 0xcb-88-ca-8c 0x0002c8: ..m. .vem. .. pm. t. v. tn. .st...k.st.]
000077 0xca-8c-6d-cb 0x00028c: .m. .vem. .. pm. t. v. tn. .st...k.st.].
000079 0x6d-cb-80-20 0x00006d: m. .vem. .. pm. t. v. tn. .st...k.st.].
000080 0xcb-80-20-cb 0x0002c0: . .vem. .. pm. t. v. tn. .st...k.st.].
000082 0x20-cb-88-76 0x000020:  .vem. .. pm. t. v. tn. .st...k.st.].
000083 0xcb-88-76-65 0x0002c8: .vem. .. pm. t. v. tn. .st...k.st.].
000085 0x76-65-6d-cb 0x000076: vem. .. pm. t. v. tn. .st...k.st.].
000086 0x65-6d-cb-80 0x000065: em. .. pm. t. v. tn. .st...k.st.].
000087 0x6d-cb-80-20 0x00006d: m. .. pm. t. v. tn. .st...k.st.].
000088 0xcb-80-20-cb 0x0002c0: . .. pm. t. v. tn. .st...k.st.].
000090 0x20-cb-88-c3 0x000020:  .. pm. t. v. tn. .st...k.st.].
000091 0xcb-88-c3-a6 0x0002c8: .. pm. t. v. tn. .st...k.st.].
000093 0xc3-a6-20-70 0x0000e6: . pm. t. v. tn. .st...k.st.].
000095 0x20-70-6d-cc 0x000020:  pm. t. v. tn. .st...k.st.].
000096 0x70-6d-cc-a9 0x000070: pm. t. v. tn. .st...k.st.].
000097 0x6d-cc-a9-20 0x00006d: m. t. v. tn. .st...k.st.].
000098 0xcc-a9-20-74 0x000329: . t. v. tn. .st...k.st.].
000100 0x20-74-c9-91 0x000020:  t. v. tn. .st...k.st.].
000101 0x74-c9-91-20 0x000074: t. v. tn. .st...k.st.].
000102 0xc9-91-20-76 0x000251: . v. tn. .st...k.st.].
000104 0x20-76-c9-91 0x000020:  v. tn. .st...k.st.].
000105 0x76-c9-91-20 0x000076: v. tn. .st...k.st.].
000106 0xc9-91-20-74 0x000251: . tn. .st...k.st.].
000108 0x20-74-6e-cc 0x000020:  tn. .st...k.st.].
000109 0x74-6e-cc-a9 0x000074: tn. .st...k.st.].
000110 0x6e-cc-a9-20 0x00006e: n. .st...k.st.].
000111 0xcc-a9-20-cb 0x000329: . .st...k.st.].
000113 0x20-cb-88-73 0x000020:  .st...k.st.].
000114 0xcb-88-73-74 0x0002c8: .st...k.st.].
000116 0x73-74-c9-9b 0x000073: st...k.st.].
000117 0x74-c9-9b-c9 0x000074: t...k.st.].
000118 0xc9-9b-c9-90 0x00025b: ...k.st.].
000120 0xc9-90-cc-af 0x000250: ..k.st.].
000122 0xcc-af-6b-c9 0x00032f: .k.st.].
000124 0x6b-c9-99-73 0x00006b: k.st.].
000125 0xc9-99-73-74 0x000259: .st.].
000127 0x73-74-c9-99 0x000073: st.].
000128 0x74-c9-99-5d 0x000074: t.].
000129 0xc9-99-5d-0a 0x000259: .].
000131 0x5d-0a-00-0a 0x00005d: ].
000132 0x0a-00-00-0a 0x00000a: .

What we see here, on each line, is:

In a text like this, which contains normal ASCII letters with some more difficult symbols interspersed, this way of showing thing gives you context, so you can see where a symbol is located. For example, to find out what that little thingy under that turned a in the last word is, you look for the “st” and “k” near the end, and see the codes preceding it are Unicode scalars 25b, 250, 32f. With the help of my Unicode code page index, code pages IPA Extensions and Combining Diacritical Marks, and Wikipedia on the IPA, we can find what it means: 25b ɛ is an open-mid front unrounded vowel, 250 ɐ is near-open central unrounded, and 32f X̯ means semivowel, or non-syllabic, that is, the vowel plays the role of a consonant. Likewise, the sign that appears under two of the symbols e, is 31d X̝, meaning “vowel raising or closing”.

Hence the name, utf8cntx, UTF8 in context, UTF8 decoded, and displayed in the context of the surrounding old-fashioned ASCII, if present.

In a text that consist of almost only non-ASCII, the advantage of context no longer holds, but the program can still be useful for decoding UTF8. For example, I recently got interested in the Javanese script (why? don’t ask), and wanted to know how exactly a sample phrase in Javanese is spelled. There is one here from the Universal Declaration of Human Rights. Just the first two words, saben uwong: ꦱꦧꦼꦤꦸꦮꦺꦴꦁ, written without a space in between. The encoding is:

000000 0xea-a6-b1-ea 0x00a9b1: ..........
000003 0xea-a6-a7-ea 0x00a9a7: .........
000006 0xea-a6-bc-ea 0x00a9bc: ........
000009 0xea-a6-a4-ea 0x00a9a4: .......
000012 0xea-a6-b8-ea 0x00a9b8: ......
000015 0xea-a6-ae-ea 0x00a9ae: .....
000018 0xea-a6-ba-ea 0x00a9ba: ....
000021 0xea-a6-b4-ea 0x00a9b4: ...
000024 0xea-a6-81-0a 0x00a981: ..

From that, I can see, again consulting info from Wikipedia and Unicode.org, that:

An alternative way to write this, not usually applied in practice, is with spaces between the words: ꦱꦧꦼꦤ꧀ ꦈꦮꦺꦴꦁ. As a result, a pangkon (A9C0) is necessary (and it appears visually, instead of making the often modified next consonant appear under the previous), and the vowel ‘u’ is written as a independent sign (A988). The encoding is:

000000 0xea-a6-b1-ea 0x00a9b1: ..... ......
000003 0xea-a6-a7-ea 0x00a9a7: .... ......
000006 0xea-a6-bc-ea 0x00a9bc: ... ......
000009 0xea-a6-a4-ea 0x00a9a4: .. ......
000012 0xea-a7-80-20 0x00a9c0: . ......
000015 0x20-ea-a6-88 0x000020:  ......
000016 0xea-a6-88-ea 0x00a988: ......
000019 0xea-a6-ae-ea 0x00a9ae: .....
000022 0xea-a6-ba-ea 0x00a9ba: ....
000025 0xea-a6-b4-ea 0x00a9b4: ...
000028 0xea-a6-81-0a 0x00a981: ..

Well, Javanese script is not the subject of this article, so it ends here.


20 April 2020

A much improved version of utf8cntx.c is utfcntxt.c (download)