14 and , 19 September 2021. Continued from the previous.
The standard C library functions lfind
,
lsearch
, bsearch
, and
qsort
are similar in that they operate
on an array. The caller should specify the start of
the array, the number of elements, and the size of
an element.
In my case of a simple word indexer, I do not have an array in core (core is an old-fashioned expression for internal working memory, or RAM), but disk files instead. I don’t want to read any disk files into memory, I prefer to rely on automatic buffering or disk caching by the operating system.
I soon found that even then I can use the standard
bsearch
. As a base pointer I pass a character
pointer that I did declare, but that doesn’t point to an array.
nmemb
is the number of lines in each of my index
files. One file has a fixed number of bytes in each line
(namely the length of a hexadecimal offset, plus a newline),
so I can efficiently calculate the number of lines by asking
fstat
for the number of bytes in the file.
As a size
, I specify sizeof char
,
which is effectively 1. See the functions
siworin_handle_srchwrds
and
siworin_calcnumwords
in
siworin-srchind.c
.
Now isn’t that dangerous? Passing a pointer to a non-existent
array? The bsearch
function will repeatedly
calculate a pointer to an array element based on that, and pass
it to the specified function for comparison with the key: is the
key greater than the element, less than the element, or equal to
it? See function siworin_compar
in source file
siworin-srchind.c
.
Yes, that is dangerous and illegal if you ever dereference such a pointer. That would almost certainly cause a segmentation fault or a similar catastrophic error. Because the pointer points to an array element that doesn’t exist, an element for which no storage has ever been allocated.
But I never do that. Instead of dereferencing the pointer,
i.e. access the element, I use the pointer only to calculate
the difference with the base pointer. In other words, although
it is a pointer that is passed to the comparison function, I
use it as an index number. With that index number, I can access
my disk file (first siworin.wordoffs
, then
siworin.words
), and make the comparison between the
key and the word found on disk.
Strange though it may seem, I think this is safe, and portable, i.e. it works correctly regardless of processor architecture.
Because I allocate siworin_fakebase
as a global variable, not an automatic
stack-based variable, it is automatically initialised to all zero
bits, so the pointer is NULL. Is there any architecture, in which
you cannot add an integer value up to a few hundred thousands,
or even millions, to a NULL pointer to obtain something that could
be a valid pointer? It doesn’t point to validly allocated memory,
but that is a different matter.
In fact, bsearch
adds a proportion of size
to base
, and in the comparison function, I subtract
base
from that result, the reverse operation. How
could that go wrong? I don’t see it.
So yes, the standard bsearch
can be
used to search an array in memory, but also a disk file.
On second thought (15 August), a pointer being NULL may be seen as a special status, meaning the pointer is invalid for dereferencing. Then doing pointer arithmetic with NULL is rather strange, and potentially dangerous, even if in practice it usually isn’t.
To improve this, instead of a pointer I now declare a character
array, initialised with an empty string, which of course then
consists of one character, a null byte. The address of that array
is then passed to bsearch
, and used to calculate
the index in the comparison function siworin_compar
.
This cast pointer points to legitimately allocated storage, even if it has a size of just one byte. So adding an integer to that seems more sensible and more responsible than adding it to NULL.
Test: it works flawlessly. But that doesn’t prove that it always will, in any environment. Yet, it is my opinion that this is safe and portable.
Copyright © 2021 by R. Harmsen, all rights reserved.