12 and . Continued from the previous.
The purpose of the word extractor is to record each word that is found in an HTML file, together with its location, i.e., the byte count just before having read the first character of the word. Whether a character is the start of a new word, can be determined only after reading and examining it. Then it’s too late. The position must have been noted before the read.
Originally, I simply called ftell
before
reading each character. For single byte characters, that
is quite fast: ftell
simply returns the
value of an internal counter that keeps track of the
current byte position in the stdio
buffer.
When I switched
to wide characters, suddenly the program became a lot
slower, from seconds to a full minute, with the fan audibly
struggling to avert processor overheating. This happened
under Linux Mint and Ubuntu Server, but not under FreeBSD.
Different C library implementation? Different compiler?
GNU
versus clang
?
By excluding possible causes step by step, I found out
with certainty that the cause was really in ftell
.
I wrote a little
test program
that for every character repeatedly called ftell
one thousand times.
By calling gettimeofday
before and after,
with microsecond resolution, I could determine how long it
took.
I found that in every 4 kB of input (4096 bytes), the time
needed for one call of ftell
increases from
about 100 ns to over 4 µs! Up by a factor 40!
With the start of a new block, every time the time suddenly
fell back to the low value. A fragment of the output from
the test program:
c=<, pos=4090 dur 4155 ns c=/, pos=4091 dur 4164 ns c=a, pos=4092 dur 4205 ns c=>, pos=4093 dur 4157 ns c=?, pos=4094 dur 4158 ns c=<, pos=4095 dur 4160 ns c=b, pos=4096 dur 4160 ns c=r, pos=4097 dur 109 ns c=>, pos=4098 dur 102 ns c=?, pos=4099 dur 104 ns c=<, pos=4100 dur 104 ns c=a, pos=4101 dur 106 ns c= , pos=4102 dur 106 ns c=h, pos=4103 dur 107 ns c=r, pos=4104 dur 108 ns c=e, pos=4105 dur 110 ns c=f, pos=4106 dur 110 ns
My attempt to explain this, is that the GNU implementation
of stdio
, when dealing with wide characters,
maintains a character position counter in the buffer, but
not a byte position counter. Because on disk, character length
in bytes can vary between 1 and 4, that is not the same thing.
Now every time someone wants to know the byte position,
not the character position, by calling ftell
,
the implementation apparently loops through half the characters
in the buffer, on average, calculating and counting their
length in bytes.
I tried to verify this by looking at the code, but I didn’t manage to spot it. Now 4 microseconds is still a very short time, but when this is done for millions of bytes that are read from files, it certainly becomes significant.
My solution was to avoid calling ftell
(still
there as a comment, search: beforeread = ftell(fpi)
in the
source file), and replace it with
wctomb
(wide character to multibyte character
conversion), which as a bonus returns the length in bytes,
which I use to update my own byte counter. Much more efficient.
Search: beforeread += wctomb(s, c);
. That indeed
made the program
fast again.
Update 27 August 2021: test program published, bug reported.
Copyright © 2021 by R. Harmsen, all rights reserved.