Simple word indexer (12)

14 and 15 August, 19 September 2021. Continued from the previous.

Binary search on disk

The standard C library functions lfind, lsearch, bsearch, and qsort are similar in that they operate on an array. The caller should specify the start of the array, the number of elements, and the size of an element.

In my case of a simple word indexer, I do not have an array in core (core is an old-fashioned expression for internal working memory, or RAM), but disk files instead. I don’t want to read any disk files into memory, I prefer to rely on automatic buffering or disk caching by the operating system.

I soon found that even then I can use the standard bsearch. As a base pointer I pass a character pointer that I did declare, but that doesn’t point to an array. nmemb is the number of lines in each of my index files. One file has a fixed number of bytes in each line (namely the length of a hexadecimal offset, plus a newline), so I can efficiently calculate the number of lines by asking fstat for the number of bytes in the file.

As a size, I specify sizeof char, which is effectively 1. See the functions siworin_handle_srchwrds and siworin_calcnumwords in siworin-srchind.c.

Now isn’t that dangerous? Passing a pointer to a non-existent array? The bsearch function will repeatedly calculate a pointer to an array element based on that, and pass it to the specified function for comparison with the key: is the key greater than the element, less than the element, or equal to it? See function siworin_compar in source file siworin-srchind.c.

Yes, that is dangerous and illegal if you ever dereference such a pointer. That would almost certainly cause a segmentation fault or a similar catastrophic error. Because the pointer points to an array element that doesn’t exist, an element for which no storage has ever been allocated.

But I never do that. Instead of dereferencing the pointer, i.e. access the element, I use the pointer only to calculate the difference with the base pointer. In other words, although it is a pointer that is passed to the comparison function, I use it as an index number. With that index number, I can access my disk file (first siworin.wordoffs, then siworin.words), and make the comparison between the key and the word found on disk.

Strange though it may seem, I think this is safe, and portable, i.e. it works correctly regardless of processor architecture.

Because I allocate siworin_fakebase as a global variable, not an automatic stack-based variable, it is automatically initialised to all zero bits, so the pointer is NULL. Is there any architecture, in which you cannot add an integer value up to a few hundred thousands, or even millions, to a NULL pointer to obtain something that could be a valid pointer? It doesn’t point to validly allocated memory, but that is a different matter.

In fact, bsearch adds a proportion of size to base, and in the comparison function, I subtract base from that result, the reverse operation. How could that go wrong? I don’t see it.

So yes, the standard bsearch can be used to search an array in memory, but also a disk file.

An array, not a pointer

On second thought (15 August), a pointer being NULL may be seen as a special status, meaning the pointer is invalid for dereferencing. Then doing pointer arithmetic with NULL is rather strange, and potentially dangerous, even if in practice it usually isn’t.

To improve this, instead of a pointer I now declare a character array, initialised with an empty string, which of course then consists of one character, a null byte. The address of that array is then passed to bsearch, and used to calculate the index in the comparison function siworin_compar.

This cast pointer points to legitimately allocated storage, even if it has a size of just one byte. So adding an integer to that seems more sensible and more responsible than adding it to NULL.

Test: it works flawlessly. But that doesn’t prove that it always will, in any environment. Yet, it is my opinion that this is safe and portable.