. Continued from the previous.
The program siworin-makelst.c
takes as input sorted lines, each of which contains a word plus
the place where it occurs: file and byte-offset. The program
writes a file, called siworin.words
by default, which
has one line per unique word, followed by one or more locations.
An example of what this looks like was
already
given in the previous article.
The byte offset of each line written is recorded in a second
file (default name is siworin.wordoffs
), which has
a fixed-length line for each word. The only thing in that line
is the hexadecimal offset of the start of the line in the other
file. This is meant to ease and simplify the searching algorithm:
thus the word list can be thought of as an array.
It makes no sense to look up English words like ‘the’, ’and’
and ‘that’, or a Dutch, Portuguese, Spanish, French or Interlingua
word such as ‘de’. Instead of working with a stop list, the
program siworin-makelst.c
counts and limits the number
of occurrences, assuming that the more often a word occurs, the less
interesting it becomes to look for it.
The hard limit is #defined as SIWORIN_MAX_OCCURR
in
siworin.h
, so it is compiled in. This should perhaps
come from a configuration file. Or it could be calculated as
a percentage of the total number of words.
Once it is determined that a word has more occurrences than the
maximum allowed, the previous locations have already been written.
To undo those writes, the file pointer must be put back to the start
of the current line, so the data for the next word will overwrite it.
That data may be shorter than the maximum, in which case a remnant
of data for the word to be suppressed may still be there. For that
reason a POSIX-defined ftruncate
is done.
At first I did that after every suppressed entry, but that proved
slow under FreeBSD, although not under Linux Mint and Ubuntu Server.
To make it fast under any OS, I decided to do the ftruncate
only once, at the end of the output file. This is in fact enough.
Anything else must have been already overwritten by data for earlier
words.
Copyright © 2021 by R. Harmsen, all rights reserved.