20–. Continued from the previous.
As described in the previous episode, this is about pointing into a file that contains UTF-8 text data. From the known start of a word (which of course we know starts with a valid character), we want to go back some number of bytes. Because UTF-8 text can contain characters that are 1, 2, 3 or 4 bytes long, it cannot be known in advance whether that new position is also the start of a valid character. It might just as well point to the 2nd, 3rd or 4th byte of it. Not a problem, as this is detectable, by trying to read a character from that position.
However, before being able to do so, in a test program I wrote
to assess the behaviour, the library function fseek
went into an infinite loop. This bug was one of my reasons for
opting for a more traditional, byte-oriented approach.
The endless loop occurred under:
In the release notes to version 2.32 and version 2.33, I noticed descriptions that may be akin to this bug, so I was hoping 2.33 would not have it. In the release notes to version 2.34 (which I did not test; UPDATE: I did!) I see nothing similar, so chances are slim that the bug has been fixed in that one.
My test program siworin14.c
is in the
subdirectory src
.
If defines a string containing the Portuguese for ‘there is’,
Há, with an accented a containing 2 bytes in UTF-8.
Then follows the Greek word for ‘yes’, 3 characters, 2 bytes each;
then the Georgian word for ‘no’, ara, 3 characters of 3
bytes each.
First I test character- and string-based conversions from multibyte
to wide character. This works well under all OS’es and libraries.
Then I write the string to a disk file named in
. That
file is closed, and reopened in two different streams, one in
character mode (by using traditional functions) and one in
wide-character mode (by using the more modern wide-character
functions).
The fact that the same file is opened in two streams simultaneously is not the cause of the problem. I know because I also tested it with only the wide-character stream. (See also the simpler program described in the next episode.)
The program does an fseek
to all byte positions, whether
a valid UTF-8 character starts there or not. The infinite loop in
fseek
starts at an invalid sequence, although it shouldn’t.
Correct output, obtained from FreeBSD 12.2, which does not have the bug,
is as follows:
retval = 1, cw = 'H' retval = 13, chars = Há, ναι, არა retval = 2, cw = 'á' retval = 12, chars = á, ναι, არა retval = -1, cw = 'á' retval = -1, chars = á, ναι, არა retval = 1, cw = ',' retval = 11, chars = , ναι, არა retval = 1, cw = ' ' retval = 10, chars = ναι, არა retval = 2, cw = 'ν' retval = 9, chars = ναι, არა retval = -1, cw = 'ν' retval = -1, chars = ναι, არა retval = 2, cw = 'α' retval = 8, chars = αι, არა retval = -1, cw = 'α' retval = -1, chars = αι, არა retval = 2, cw = 'ι' retval = 7, chars = ι, არა retval = -1, cw = 'ι' retval = -1, chars = ι, არა retval = 1, cw = ',' retval = 6, chars = , არა retval = 1, cw = ' ' retval = 5, chars = არა retval = 3, cw = 'ა' retval = 4, chars = არა retval = -1, cw = 'ა' retval = -1, chars = არა retval = -1, cw = 'ა' retval = -1, chars = არა retval = 3, cw = 'რ' retval = 3, chars = რა retval = -1, cw = 'რ' retval = -1, chars = რა retval = -1, cw = 'რ' retval = -1, chars = რა retval = 3, cw = 'ა' retval = 2, chars = ა retval = -1, cw = 'ა' retval = -1, chars = ა retval = -1, cw = 'ა' retval = -1, chars = ა retval = 1, cw = '?' retval = 1, chars = Pos 0, char 48-H, wide char 00000048-H Pos 1, char c3-?, wide char 000000e1-á Line 69, error 86 Illegal byte sequence Pos 2, char a1-?, wide char ffffffff-? Pos 3, char 2c-,, wide char 0000002c-, Pos 4, char 20- , wide char 00000020- Pos 5, char ce-?, wide char 000003bd-ν Line 69, error 86 Illegal byte sequence Pos 6, char bd-?, wide char ffffffff-? Pos 7, char ce-?, wide char 000003b1-α Line 69, error 86 Illegal byte sequence Pos 8, char b1-?, wide char ffffffff-? Pos 9, char ce-?, wide char 000003b9-ι Line 69, error 86 Illegal byte sequence Pos 10, char b9-?, wide char ffffffff-? Pos 11, char 2c-,, wide char 0000002c-, Pos 12, char 20- , wide char 00000020- Pos 13, char e1-?, wide char 000010d0-ა Line 69, error 86 Illegal byte sequence Pos 14, char 83-?, wide char ffffffff-? Line 69, error 86 Illegal byte sequence Pos 15, char 90-?, wide char ffffffff-? Pos 16, char e1-?, wide char 000010e0-რ Line 69, error 86 Illegal byte sequence Pos 17, char 83-?, wide char ffffffff-? Line 69, error 86 Illegal byte sequence Pos 18, char a0-?, wide char ffffffff-? Pos 19, char e1-?, wide char 000010d0-ა Line 69, error 86 Illegal byte sequence Pos 20, char 83-?, wide char ffffffff-? Line 69, error 86 Illegal byte sequence Pos 21, char 90-?, wide char ffffffff-? Pos 22, char 0a-?, wide char 0000000a-?
Under GNU 2.31 and 2.33, after dealing with the plain ASCII
‘H’ and the multibyte ‘á’, the program never reaches line
68 to try to read the invalid UTF-8 follow-up byte on its
own. Already at the fseek
at line 63, it starts
looping without end. Fans start kicking in, for fear of an
overheated processor chip.
GNU’s debugger gdb
let me debug the program even
into the library. That way I found that the loop happens in
source file
libio/wfileops.c
, function
adjust_wide_data
starting at line 547, the
do while
loop. Notable source lines that I kept
seeing were:
libio/wfileops.c line 576
,
libio/iofwide.c line 189
, and
iconv/skeleton.c line 399
.
I don’t really see why there should be a loop in the
first place. And why is iconv
involved? I can imagine
that when reading or writing characters, when filling a buffer,
conversions between multibyte and wide chars need to take place,
and that is what iconv
is for. But fseek
is just positioning a file pointer to a byte position, as a
preparation for future (often imminent) reads
or writes.
Isn’t
fseek(stream, offset, whence)
always the same as
lseek(fileno(stream), offset, whence)
?
Going from man 3
C library functions to
man 2
system calls?
But I know it’s easy to
say that without having fully studied, or even done, the
implementation of the whole of stdio
. Things
that seem simple, and are simple, can still become complicated
when considering all the details. I know that from experience.
To end this article with, an observation about
ftell/fseek
versus
fgetpos/fsetpos
. The first two functions work
with an exact byte position, which can be awkward in a modern
environment that can deal with multibyte characters of
variable length. But the functions should nevertheless work
correctly.
The second two functions work with a data type
fpos_t
, the internal contents of which are
system-defined, so the programmer should not make assumptions
about them. fsetpos
should only be called with an
fpos_t
value validly obtained from an
fgetpos
call. That ensures that we’re always
starting at the start of a multibyte character. It is
preferable to work like that whenever possible. But it
isn’t always, as in my simple word indexer example that
this series of articles is about.
Copyright © 2021 by R. Harmsen, all rights reserved.