Simple word indexer (4)

. Continued from the previous.

My own

That night of the 2nd of July 2021, I was so fed up with the whole situation, of local search engines that don’t properly install, don’t properly work, or work a few years, but not on all platforms, and that are not properly maintained, that I took the brave step: I decided to write one myself. Couldn’t be so hard, if you keep it simple.

I got some algorithmic ideas, that I thought made it quite feasible, and also fast and efficient.

Design criteria:

  1. Keep it simple.

  2. Support only Unicode UTF-8. No other encodings like ISO 8859-1 or Windows 1252.

  3. Do only minimal parsing for HTML and XHTML, so that what is inside < > is not included. No further interpretation.

  4. Don’t support entities such as &alpha; for a Greek α, or &uuml; for a German ü. Only one exception: &shy, which I use for my hyphenation strategy – see this (in Dutch).

  5. Assume that the word index for the whole website will be completely rebuilt each time. No support for adding just one new HTML file, or reflecting the changes in an edited file.

  6. Support only static HTML files, no Javascript, PHP, ASP, etc. Siworin collects its words from the inside, from files; not from the outside, from web pages.

  7. Make it fast enough, assuming modern hardware (up to 10 years old), including Raspberry Pi. If a sequential search like by (e)grep is fast enough, even though it’s very inefficient, that is fine.

  8. Support only Unix (Linux and FreeBSD). If it happens to work on Windows, or it is easily portable to Windows, that’s nice. But it’s not a design goal, and I won’t test it myself.

  9. Use internal working memory sparingly, do not assume giga­bytes of them being available. Better use simple text files and rely on automatic disk caching.

  10. C (the old standard C89) is good enough for everything. No C++, no more modern extensions of C. Well, on closer inspection, I do use wide characters, so C95, and I never use the “dangerous C89 language features” that were officially removed with C99.

Now on to a description of the algorithm.