Simple word indexer (17)

27 and 28 September 2021. Continued from a previous article.

Positioning to a character (4)

Not a severe bug?

In my bug report I wrote:

“When doing an fseek to a position just after the start of a valid UTF-8 character, after that character itself has been read with fgetcw just before, fseek will get into an infinite loop.”

So the endless loop in fseek happened on invalid after valid. Not likely to occur in real life. Now however, I also get the loop on valid after invalid. The reverse situation. Trying to recover from an ISO 8859-1 byte in a stream that was supposed to contain UTF-8.

The earlier situation happened in a test program that I wrote, trying to assess possible problems in the real application. That made the bug seem less severe. The new case in which I encountered the infinite loop, however, was when improving the real program, my local search engine, called Siworin. I tried to make it more inclusive, and more robust, in the presence of invalid characters.

I’ll explain how the new situation came about. Then I’ll talk about valid/invalid and invalid/valid, clarified by a new self-contained demonstration of the GNU glibc bug, siworin17.c.

Why handle non UTF-8?

I designed my search engine for UTF-8 only. Almost all of my website is now in that encoding, with the exception of some historic material, not written by me, that I prefer to keep untouched and unchanged.

But Siworin is also more or less suitable to index and search MBX mailbox files by Eudora. Eudora has a built-in search facility, which sequentially goes through all the mailboxes, or a selection of them. Even specifying that selection is slow. The searching itself is fast considering the size of my archive of over 20 years, yet one search can easily take minutes. That’s too long to be practical.

I’ve been using Eudora for a very long time, first version 1.4, then 2.2, now 7.1.0.9. But even that last one dates from 2006. There is no support for UTF-8, everything is still in ISO 8859-1. I looked at Mozilla Thunderbird several times, didn’t like it, and didn’t find an import facility for existing mailboxes. So I’ll keep using Eudora, now in wine under Linux Mint. I’ll also keep my e-mail archive.

Not only when indexing old mailboxes can invalid UTF-8 occur, but also when inadvertently handling JPG picture files. Or ZIP files. Or executables. Of course it makes no sense to do that, but if it just happens to happen, the word extractor and search engine should behave in a sensible way, and certainly not get into an endless loop.

My current solution: replace the offending character with a dash, which is deemed to be part of a word if it is somewhere in the middle, so many words with accented 8859 characters remain recognisable, and can be found.

Why first unnoticed?

In an earlier version of my word separator siworin-wordsep.c, I tested on function fgetwc returning WEOF (‘Wide character End Of File’), to terminate reading from the current file. So any time something accented in ISO 8859-1 occurred in an e-mail boxfile (not frequent in Dutch, but it does happen), the rest of that file was simply skipped, and word extraction continued with the next file.

I noticed because I got very few hits on a key word for which the built-in Eudora search did find many. I isolated the problem, and found the cause.

The return value WEOF of fgetwc is ambiguous: it can mean end-of-file, but also any kind of error, including a wide character conversion error under the current locale. To quote from the manual page:

“If the end of stream is reached, or if ferror(stream) becomes true, it returns WEOF. If a wide-character conversion error occurs, it sets errno to EILSEQ and returns WEOF.
[...]
The fgetwc() function returns the next wide-character from the stream, or WEOF. In the event of an error, errno is set to indicate the cause.”

Better fgetwc error handling

So instead of testing on return value WEOF, I tested feof(). At first, that was the only change I made. But that caused an infinite loop. In my own program, not in a standard library function, so under my own responsibility. I assumed that the stdio library, so also fgetwc, when encountering a single byte encoding that isn’t UTF-8, would consider that byte consumed, so the next reading attempt would continue after that byte.

But it isn’t so. And understandably so. I do know, in my specific situation and with the data I am using, that when it isn’t UTF-8, it’s probably ISO 8859-1 or -15 or Windows code page 1252. But a general C library cannot know that, and would not be correct in assuming it. That’s because if it isn’t the UTF-8 that according to the locale it should be, it could be anything, like UTF-16, BIG5 Chinese, or some Japanese encoding like JIS X 0212 or Shift JIS, etc. etc. So how many bytes to skip, how many to consider consumed? That’s undefined.

So when the library expects UTF-8, but sees an invalid byte, it considers that byte still unread. If you read again, you get the same result, over and over. An endless loop.

To solve this, you need to skip over the invalid character, or actually, over the invalid byte, to see if the next byte perhaps contains a valid character again. The invalid character should be treated as something harmless. In my program, I turn it into a dash, ‘-’, so the word I am trying to detect, is supposed to continue.

I test on ferror(). If there is an error, but errno is not equal to EILSEQ, it must be some unknown error, and I terminate the program which a message.

Otherwise, the question is how to skip. First, I did an
fseek(fp, 1, SEEK_CUR);
But is that reliable? Is the current stream position SEEK_CUR well defined after encountering an invalidly encoded character? Better:
fseek(fpi, beforeread + 1, SEEK_SET);
where beforeread is a byte position counter I maintain myself (to avoid a slow ftell()): beforeread is the position the stream was in just before the last call of fgetwc.

However, in both cases, fseek under Linux Mint 20.1 and GNU glibc 2.31, gets into an infinite loop. That is the bug I reported before. In the same situation under FreeBSD 12.2, this works well. This makes the bug more serious that I first thought, because it prevents me from doing what every responsible programmer should always do: make a program robust against invalid input, even against utter garbage.

After some pondering, I found what Germans call an Umgehung, a work-around: call fgetc to skip a byte. That however is a rather strange thing to do, as it means mixing wide-character mode and the traditional byte-oriented way of dealing with a stream. I quote from the manual page for function fwide (a function I do not use and do not need; but its description is nevertheless interesting):

“When mode is zero, the fwide() function determines the current orientation of stream. It returns a positive value if stream is wide-character oriented, that is, if wide-character I/O is permitted but char I/O is disallowed. It returns a negative value if stream is byte oriented – that is, if char I/O is permitted but wide-character I/O is disallowed. It returns zero if stream has no orientation yet; in this case the next I/O operation might change the orientation (to byte oriented if it is a char I/O operation, or to wide-character oriented if it is a wide-character I/O operation).

Once a stream has an orientation, it cannot be changed and persists until the stream is closed.”

So what I’m doing, char I/O on a stream that has become wide-character oriented when I did the first fgetwc, isn’t even allowed, and shouldn’t work. But it does. And what else can I do, if the better way causes a loop?

By the way, what I quoted is from Linux Mint; FreeBSD 12.2 does not describe this limitation, it only says:

“If the orientation of stream has already been determined, fwide() leaves it unchanged.”

See function HandleInvalidChar() in source file siworin-wordsep.c for the way I programmed it.

Self-contained bug demo

There is also a minimal self-contained demonstration program, similar to the previous two, numbers 14 and 15.

First, it creates its own input file, called in. It contains the Portuguese word for ‘blessing’, bênção. It has three non-ASCII characters in it, of which two (‘ç’ and ‘ã’) in succession. At some point I thought it made a difference whether there is just one such character, or two adjacent ones. But it doesn’t.

The first string I build the file from is in UTF-8. Then, encoding bytes in octal, I write it also as ISO 8859-1, ISO 8859-15, or Windows code page CP1252. It this case it makes no difference which of the three exactly.

The file is closed, and opened in read mode. If the compiled program is run without a command line argument, it does the fgetc-trick, and it works correctly. If however there is a command line argument, e.g. ‘loop’, the program uses fseek, which loops indefinitely under Linux, but not under FreeBSD.

This should be fixed in glibc

Conclusion: this GNU glibc bug should really be repaired, urgently, because there should be a decent way to make a program, even if designed for UTF8, robust against other encodings, and against any kind of garbage whatsoever. Endless loops are never acceptable, much less in a standard library function.

With the current state of affairs, making my program robust is only possible in a weird and unrecommended, possibly even illegal way.

To the next article on the same sub-subject.