19 and , 19 September 2021. Continued from the previous.
When finding a word, using the index that has been built up, of course it isn’t enough to show the word and the name of the file it occurs in. We also want to see context.
That should be easy. We know exactly where the word starts. What could be simpler than skipping a few tens of bytes back, and a few tens forward, and show those bytes on either side of the highlighted search word?
Well, there’s a little problem there. Nowadays, a byte is not the same as a character anymore. A character in UTF-8 – and I expect that that is the only encoding to stick around, in the near and remote future and everything in between – can be one byte long, namely if it is plain ASCII. Otherwise, it can also be three bytes, roughly for India and China; or four bytes, for rather specialised, unusual and ancient characters; or two bytes for most everything else: accented Latin, and Greek, Russian, Hebrew, Arabic.
That is the short story, details
are here.
Two bytes are needed from scalar code point 0080
(8–11 bits
to encode), three UTF-8 bytes from 0800
(12–16 bits)
and four starting with Unicode scalar 10000
(17–20 bits, and a little beyond). A total of 1114112 code
points in theory (but some are reserved or have special purposes).
So if you just pick a byte position in a file, it could be the start of a character, but just as well the second, third or fourth byte of an incomplete sequence, which is therefore invalid and uninterpretable. Likewise, by ending the sample just somewhere, at some byte count, it could be that one or more required bytes of a valid sequence are missing. A real-life example, from my 2010 Portuguese article Terra Poderosa (Powerful Earth):
“�gua pode fazer com que a água penetre no furo e expulse o petróleo.”
This comes from a more complete “pressão da água pode fazer com que”, but only the second byte of the two, which are required to encode the letter ‘a with acute accent’ for Portuguese (Unicode: e1, UTF-8 encoding, c3-a1, all in hexadecimal) was read from the file.
The browser does not and cannot know what to do with hex a1
by itself, which is invalid and does not mean anything. So the browser
instead displays the special Unicode character hex fffd
,
the so-called replacement character, �, described as:
“used to replace an incoming character whose value
is unknown or unrepresentable in Unicode”.
The proper way to solve this is to deal with
characters, not bytes. Multibyte characters, wide characters. I wrote
about that
before. However, when
I wrote a little test program to find out what happens when you
fseek
to an invalid character position, and read from
there, I found there is a problem. A bug in the library. I will
describe that in detail in the
next episode.
And even before that, I had decided not to do it that way. Instead, I used a quick and not so dirty solution, which works with bytes without requiring characters:
At the start of the buffer, replace any UTF-8 follow bytes
with spaces. A valid UTF-8 character always starts with a byte
whose initial number of 1-bits indicates the character’s total
length in bytes. If there is no such byte, and it also isn’t
plain ASCII, we could only be dealing with an incomplete
character sequence. Recognition: UTF-8 follow bytes start with
the bits 10
.
At the end of the buffer, I replace any non-ASCII with null-bytes. That deals with the case that fewer follow bytes are present than is required. But it possibly sacrifices a valid non-ASCII character. Because the size of the context to be shown is rather arbitrary, in my opinion that is acceptable.
KISS, keep it simple. Implemented in function
ClearInvalid
in
siworin-displ.c
.
I could however also have used my own
UTF tools, which can
test whether a UTF-8 sequence is a valid character. But I didn’t
want more dependencies. It can probably also be done using the
multibyte functions in the C stdlib
.
Update 19 September 2021: I kept it simple and made it even simpler: I now just clear any non-whitespace at the start and end of the buffer to be displayed. That removes any invalid UTF-8, and it ensures only whole words are shown.
Copyright © 2021 by R. Harmsen, all rights reserved.