Simple word indexer (11)

14 August 2021. Continued from the previous.

Combining

The program siworin-makelst.c takes as input sorted lines, each of which contains a word plus the place where it occurs: file and byte-offset. The program writes a file, called siworin.words by default, which has one line per unique word, followed by one or more locations. An example of what this looks like was already given in the previous article.

The byte offset of each line written is recorded in a second file (default name is siworin.wordoffs), which has a fixed-length line for each word. The only thing in that line is the hexadecimal offset of the start of the line in the other file. This is meant to ease and simplify the searching algorithm: thus the word list can be thought of as an array.

It makes no sense to look up English words like ‘the’, ’and’ and ‘that’, or a Dutch, Portuguese, Spanish, French or Interlingua word such as ‘de’. Instead of working with a stop list, the program siworin-makelst.c counts and limits the number of occurrences, assuming that the more often a word occurs, the less interesting it becomes to look for it.

The hard limit is #defined as SIWORIN_MAX_OCCURR in siworin.h, so it is compiled in. This should perhaps come from a configuration file. Or it could be calculated as a percentage of the total number of words.

Once it is determined that a word has more occurrences than the maximum allowed, the previous locations have already been written. To undo those writes, the file pointer must be put back to the start of the current line, so the data for the next word will overwrite it. That data may be shorter than the maximum, in which case a remnant of data for the word to be suppressed may still be there. For that reason a POSIX-defined ftruncate is done.

At first I did that after every suppressed entry, but that proved slow under FreeBSD, although not under Linux Mint and Ubuntu Server. To make it fast under any OS, I decided to do the ftruncate only once, at the end of the output file. This is in fact enough. Anything else must have been already overwritten by data for earlier words.