Simple word indexer (7)

30 July and . Continued from the previous.

Word separation (1)

Inspiration

The program was inspired by, based upon, and simplified from earlier word extractors I wrote in 2011 and 2013. Those worked for ASCII and for single-byte extensions such as ISO-8859-1 or Windows codepage 1252. This new program however expects UTF-8 in the files it examines. That is also the only encoding that it supports. Nothing else.

Source code siworin-wordsep.c is here.

Wide

This made it necessary to replace a lot of standard C types, library functions, and fprintf formats with their wide-character equivalents, as follows:

for single-byte charactersfor wide characters
int wint_t
char wchar_t
fgetc fgetwc
isalpha iswalpha
tolower towlower
%s %ls
sizeof (char) wctomb
"<string>" L"<wide string>"
'<char>' L'<wide char>'
strlen wcslen
strstr wcsstr
memmove wmemmove

You must set a locale, by a function call such as:
setlocale(LC_ALL, "en_US.UTF-8");
I chose to set an explicit American English and UTF-8, although that is also what my operating system and shell (bash) set by default. What really matters here is not the language, but the encoding. It has to be UTF-8 or else the program won’t work as intended.

Not calling setlocale is not an option. In Dutch we say “Een ezel stoot zich in ’t gemeen / geen twee keer aan dezelfde steen” (I find the English equivalent “Once bitten, twice shy”, but that is literally quite different; an “ezel” is an ass, a donkey, or a stupid person), however, I did, I had already forgotten my own article, although I wrote it little over one year ago. As a result, I don’t know why but that is what happened in the test, I got an end-of-file as soon as any non-ASCII occurred in the text.

This wide-character feature that I use in this little program is actually quite cool. Wide characters should not be confused with multibyte characters, because that is quite a different concept, although the two are related. Multibyte is what you have in a file, encoded for example in UTF-8.

UTF-8 multibyte is variable length. A character can be one byte long (pure 7-bits ASCII), or two bytes (the Latin script with diacritics, IPA, Greek, Cyrillic, Armenian, Hebrew, Arabic, etc.), three bytes (Samaritan, scripts of India, Thai and other Southeast Asian scripts, Chinese, Japanese, Korean, etc. etc.), or even four bytes (Linear B, lots of other historic scripts, musical notation, lots of special symbols, emoticons, etc. etc. etc.).

Details and links to details are here and here.

Once read from a file, multibyte characters are turned into wide characters, a data type wide enough to hold any Unicode character. Usually they are 32-bits integers, but don’t rely on that. The advantage of wide characters over multibyte characters is that wide characters have fixed length storage, so you can have arrays of them, count lengths, calculate a distance expressed in characters instead of bytes, let a pointer or index step through an array, etc.

In fact wide characters behave very much like the traditional one-byte char type (with an int-like variant if a variable should be able to hold a distinct end-of-file value), but with many more possible character values. The equivalent is the type wchar_t, which also has a variant wint_t that is able to hold a unique end-of-file value, that does not equal any valid character.

Usually multibyte is used in files, and wide is used in memory. But either can be in both.

Note that in my C source siworin-wordsep.c I wrote «L'’'» because the curly single quote is not an ASCII character. The L indicates it is a long type, that is, a wide character. But I also wrote the ASCII single quote as «'\''». (See lines 87 and 138). I could have written L'\'' to make this a wide character too, but that is not required, because the compiler will do the necessary conversion (or in C terms: the cast) automatically where needed. Here the compiler knows what to do, because the variable c is of type wint_t, and p is a pointer to type wchar_t.

If you use the right types, and the right library functions for wide characters, traditional algorithms and best practices known from working with simple one-byte characters, can remain essentially the same. So they will also work for Unicode, and so, for any script in the world that is, has ever been, or will ever be. Really great stuff.

Wide character programming is only necessary if you need to examine strings or texts in detail, for example if you need to know their lengths in characters (not bytes), if you are interested in whether a character is alphabetic, numeric, punctuation etc., or if in an editor-like program you need to move a cursor from one character to the next, etc.

If however you are dealing with strings, or lines or files of text without looking into them, just passing them on; or looking at them superficially based on parts that are guaranteed to be plain ASCII (e.g. finding a delimiter character), you can afford to use the old and traditional single-byte approach, even with texts that are in fact multibyte.

This is what I do in the sorting, combining (siworin-makelst.c), and searching phases of my ‘Simple word indexer’ algorithm. Combining the two approaches, using the same data, can work correctly and conveniently, if done carefully.


More about the word extractor here.