27 and . Continued from a previous article.
In my bug report I wrote:
“When doing an fseek to a position just after the start of a valid UTF-8 character, after that character itself has been read with fgetcw just before, fseek will get into an infinite loop.”
So the endless loop in fseek happened on invalid after valid. Not likely to occur in real life. Now however, I also get the loop on valid after invalid. The reverse situation. Trying to recover from an ISO 8859-1 byte in a stream that was supposed to contain UTF-8.
The earlier situation happened in a test program that I wrote, trying to assess possible problems in the real application. That made the bug seem less severe. The new case in which I encountered the infinite loop, however, was when improving the real program, my local search engine, called Siworin. I tried to make it more inclusive, and more robust, in the presence of invalid characters.
I’ll explain how the new situation came about. Then I’ll talk
about valid/invalid and invalid/valid, clarified by a new
self-contained demonstration of the GNU glibc bug,
siworin17.c
.
I designed my search engine for UTF-8 only. Almost all of my website is now in that encoding, with the exception of some historic material, not written by me, that I prefer to keep untouched and unchanged.
But Siworin is also more or less suitable to index and search MBX mailbox files by Eudora. Eudora has a built-in search facility, which sequentially goes through all the mailboxes, or a selection of them. Even specifying that selection is slow. The searching itself is fast considering the size of my archive of over 20 years, yet one search can easily take minutes. That’s too long to be practical.
I’ve been using Eudora for a very long time, first version 1.4, then 2.2, now 7.1.0.9. But even that last one dates from 2006. There is no support for UTF-8, everything is still in ISO 8859-1. I looked at Mozilla Thunderbird several times, didn’t like it, and didn’t find an import facility for existing mailboxes. So I’ll keep using Eudora, now in wine under Linux Mint. I’ll also keep my e-mail archive.
Not only when indexing old mailboxes can invalid UTF-8 occur, but also when inadvertently handling JPG picture files. Or ZIP files. Or executables. Of course it makes no sense to do that, but if it just happens to happen, the word extractor and search engine should behave in a sensible way, and certainly not get into an endless loop.
My current solution: replace the offending character with a dash, which is deemed to be part of a word if it is somewhere in the middle, so many words with accented 8859 characters remain recognisable, and can be found.
In an earlier version of my word separator
siworin-wordsep.c
, I tested on
function fgetwc
returning WEOF (‘Wide character End
Of File’), to terminate reading from the current file. So any time
something accented in ISO 8859-1 occurred in an e-mail boxfile
(not frequent in Dutch, but it does happen), the rest of that file
was simply skipped, and word extraction continued with the next file.
I noticed because I got very few hits on a key word for which the built-in Eudora search did find many. I isolated the problem, and found the cause.
The return value WEOF of fgetwc
is ambiguous: it can
mean end-of-file, but also any kind of error, including a wide
character conversion error under the current locale. To
quote from the manual page:
“If the end of stream is reached, or if
ferror(stream) becomes true, it returns WEOF. If a wide-character
conversion error occurs, it sets errno to EILSEQ and returns
WEOF.
[...]
The fgetwc() function returns the next
wide-character from the stream, or WEOF. In the event of an error,
errno is set to indicate the cause.”
So instead of testing on return value WEOF, I tested feof()
.
At first, that was the only change I made. But that caused an infinite
loop. In my own program, not in a standard library function, so under
my own responsibility. I assumed that the stdio
library,
so also fgetwc
, when encountering a single byte encoding
that isn’t UTF-8, would consider that byte consumed, so the next reading
attempt would continue after that byte.
But it isn’t so. And understandably so. I do know, in my specific
situation and with the data I am using, that when it isn’t UTF-8, it’s
probably ISO 8859-1 or -15 or Windows code page 1252. But a general
C library cannot know that, and would not be correct in assuming it.
That’s because if it isn’t the UTF-8 that according to the locale
it should be, it could be anything, like UTF-16, BIG5 Chinese, or some
Japanese encoding like JIS X 0212 or Shift JIS, etc. etc. So how many
bytes to skip, how many to consider consumed? That’s undefined.
So when the library expects UTF-8, but sees an invalid byte, it considers that byte still unread. If you read again, you get the same result, over and over. An endless loop.
To solve this, you need to skip over the invalid character, or actually, over the invalid byte, to see if the next byte perhaps contains a valid character again. The invalid character should be treated as something harmless. In my program, I turn it into a dash, ‘-’, so the word I am trying to detect, is supposed to continue.
I test on ferror()
. If there is an error, but
errno
is not equal to EILSEQ
, it must be some
unknown error, and I terminate the program which a message.
Otherwise, the question is how to skip. First, I did an
fseek(fp, 1, SEEK_CUR);
But is that reliable? Is the current stream position SEEK_CUR
well defined after encountering an invalidly encoded character?
Better:
fseek(fpi, beforeread + 1, SEEK_SET);
where beforeread
is a byte position counter I maintain
myself (to avoid a slow
ftell()
): beforeread
is the position the
stream was in just before the last call of fgetwc
.
However, in both cases, fseek
under Linux Mint 20.1 and
GNU glibc 2.31, gets into an infinite loop. That is
the bug I reported before. In the same situation
under FreeBSD 12.2, this works well. This makes the bug more serious
that I first thought, because it prevents me from doing what every
responsible programmer should always do: make a program robust against
invalid input, even against utter garbage.
After some pondering, I found what Germans call an Umgehung, a
work-around: call fgetc
to skip a byte. That however is a
rather strange thing to do, as it means mixing wide-character mode and
the traditional byte-oriented way of dealing with a stream. I quote from
the manual page for function fwide
(a function I do not
use and do not need; but its description is nevertheless interesting):
“When mode is zero, the fwide() function determines
the current orientation of stream. It returns a positive value if stream
is wide-character oriented, that is, if wide-character I/O is permitted
but char I/O is disallowed. It returns a negative value if stream is byte
oriented – that is, if char I/O is permitted but wide-character
I/O is disallowed. It returns zero if stream has no orientation yet; in
this case the next I/O operation might change the orientation (to byte
oriented if it is a char I/O operation, or to wide-character oriented
if it is a wide-character I/O operation).
Once a stream has an orientation, it cannot be changed and persists
until the stream is closed.”
So what I’m doing, char I/O on a stream that has become wide-character
oriented when I did the first fgetwc
, isn’t even allowed,
and shouldn’t work. But it does. And what else can I do, if the better
way causes a loop?
By the way, what I quoted is from Linux Mint; FreeBSD 12.2 does not describe this limitation, it only says:
“If the orientation of stream has already been determined, fwide() leaves it unchanged.”
See function HandleInvalidChar()
in source file
siworin-wordsep.c
for the way I
programmed it.
There is also a minimal
self-contained
demonstration
program, similar to the previous two,
numbers 14 and 15.
First, it creates its own input file, called in
.
It contains the Portuguese word for ‘blessing’, bênção. It has
three non-ASCII characters in it, of which two (‘ç’ and ‘ã’) in
succession. At some point I thought it made a difference whether
there is just one such character, or two adjacent ones. But it
doesn’t.
The first string I build the file from is in UTF-8. Then, encoding bytes in octal, I write it also as ISO 8859-1, ISO 8859-15, or Windows code page CP1252. It this case it makes no difference which of the three exactly.
The file is closed, and opened in read mode. If the compiled
program is run without a command line argument, it does the
fgetc
-trick, and it works correctly. If however
there is a command line argument, e.g. ‘loop’, the program
uses fseek
, which loops indefinitely under Linux,
but not under FreeBSD.
Conclusion: this GNU glibc bug should really be repaired, urgently, because there should be a decent way to make a program, even if designed for UTF8, robust against other encodings, and against any kind of garbage whatsoever. Endless loops are never acceptable, much less in a standard library function.
With the recent state of affairs, making my program robust is only possible in a weird and unrecommended, possibly even illegal way.
To the next article on the same sub-subject.
Copyright © 2021 by R. Harmsen, all rights reserved.