30 July and . Continued from the previous.
The program was inspired by, based upon, and simplified from earlier word extractors I wrote in 2011 and 2013. Those worked for ASCII and for single-byte extensions such as ISO-8859-1 or Windows codepage 1252. This new program however expects UTF-8 in the files it examines. That is also the only encoding that it supports. Nothing else.
siworin-wordsep.c is here.
This made it necessary to replace a lot of standard C types, library functions, and fprintf formats with their wide-character equivalents, as follows:
|for single-byte characters
|for wide characters
You must set a locale, by a function call such as:
I chose to set an explicit American English and UTF-8, although that is also what my operating system and shell (
bash) set by default. What really matters
here is not the language, but the encoding. It has to be
UTF-8 or else the program won’t work as intended.
setlocale is not an option.
In Dutch we say “Een ezel stoot zich in ’t gemeen /
geen twee keer aan dezelfde steen” (I find the English
equivalent “Once bitten, twice shy”, but that is literally
quite different; an “ezel” is an ass, a donkey, or
a stupid person), however, I did, I had already forgotten
my own article, although I wrote it little over
one year ago.
As a result, I don’t know why but that is what happened
in the test, I got an end-of-file as soon as any non-ASCII
occurred in the text.
This wide-character feature that I use in this little program is actually quite cool. Wide characters should not be confused with multibyte characters, because that is quite a different concept, although the two are related. Multibyte is what you have in a file, encoded for example in UTF-8.
UTF-8 multibyte is variable length. A character can be one byte long (pure 7-bits ASCII), or two bytes (the Latin script with diacritics, IPA, Greek, Cyrillic, Armenian, Hebrew, Arabic, etc.), three bytes (Samaritan, scripts of India, Thai and other Southeast Asian scripts, Chinese, Japanese, Korean, etc. etc.), or even four bytes (Linear B, lots of other historic scripts, musical notation, lots of special symbols, emoticons, etc. etc. etc.).
Once read from a file, multibyte characters are turned into wide characters, a data type wide enough to hold any Unicode character. Usually they are 32-bits integers, but don’t rely on that. The advantage of wide characters over multibyte characters is that wide characters have fixed length storage, so you can have arrays of them, count lengths, calculate a distance expressed in characters instead of bytes, let a pointer or index step through an array, etc.
In fact wide characters behave very much like the traditional
char type (with an int-like variant if
a variable should be able to hold a distinct end-of-file value),
but with many more possible character values. The equivalent is
wchar_t, which also has a variant
wint_t that is able to hold a unique end-of-file
value, that does not equal any valid character.
Usually multibyte is used in files, and wide is used in memory. But either can be in both.
Note that in my
siworin-wordsep.c I wrote «
single quote is not an ASCII character. The L indicates
it is a long type, that is, a wide character. But I also
wrote the ASCII single quote as «
(See lines 87 and 138). I could have written
to make this a wide character too, but that is not required,
because the compiler will do the necessary conversion (or in C
cast) automatically where needed. Here
the compiler knows what to do, because the variable
is of type
p is a pointer
If you use the right types, and the right library functions for wide characters, traditional algorithms and best practices known from working with simple one-byte characters, can remain essentially the same. So they will also work for Unicode, and so, for any script in the world that is, has ever been, or will ever be. Really great stuff.
Wide character programming is only necessary if you need to examine strings or texts in detail, for example if you need to know their lengths in characters (not bytes), if you are interested in whether a character is alphabetic, numeric, punctuation etc., or if in an editor-like program you need to move a cursor from one character to the next, etc.
If however you are dealing with strings, or lines or files of text without looking into them, just passing them on; or looking at them superficially based on parts that are guaranteed to be plain ASCII (e.g. finding a delimiter character), you can afford to use the old and traditional single-byte approach, even with texts that are in fact multibyte.
This is what I do in the sorting, combining
siworin-makelst.c), and searching phases of my
‘Simple word indexer’ algorithm.
Combining the two approaches, using the same data, can
work correctly and conveniently, if done carefully.
More about the word extractor here.