Automatic site index (word separator)

22 November 2011

Introduction

As promised, I’ll explain the word separation program I use to generate the keyword index for my website.

Below is the source code. Longer comments are in the next section, with clickable links to and from them.

Source code

/* Author: Ruud Harmsen.
   Copyright © 2011, all rights reserved.
*/
/* Input (stdin): a list of paths to files, e.g. HTML files.
   Output (stdout): words in the input, one word per line,
      all converted to lowercase.

   A word is a sequence of at least one alphabetic character,
   followed by at least one alpha of digit. (So there's a minimum
   length of 2). Single quote ' and hyphen are also included.
   "deroff -w" does this too, but it doesn't support ISO-8859-1
   accented characters. Moreover, it isn't available in all
   Unix versions, e.g. FreeBSD doesn't have it.
 */

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

static int ExtractWords (FILE *fpi, FILE *fpo);

int main (int argc, char **argv)
{
   char filepath[129];
   FILE *fp;

   while (fgets(filepath, sizeof filepath, stdin) == filepath)
   {
      char *p;

      // Note 1
      if (p = strrchr(filepath, '\n'))
         *p ='\0';
      if ((fp = fopen(filepath, "rt")) == NULL)
         continue;

      ExtractWords(fp, stdout);

      fclose(fp);
   }

   return 0;
}

static int ExtractWords (FILE *fpi, FILE *fpo)
{
   // Note 2
   int inword = 0;
   int incmmd = 0;
   int c;
   unsigned char wordbuf[81];
   unsigned char *w = wordbuf;

   while (c = getc(fpi), c != EOF)
   {
      // Note 2
      if (c == '<')
         incmmd = 1;
      else if (c == '>')
         incmmd = 0;

      // Note 2
      if (!incmmd && !inword && (isalpha(c) || (c >= 0xc0 && c <= 0xff)))
      {
         inword = 1;
         w = wordbuf;
      }
      // Note 3, note 7
      else if (inword &&
         !isalpha(c) && !isdigit(c) && c != '\'' &&
         c != '-' && !(c >= 0xc0 && c <= 0xff))
      {
         if (w > wordbuf + 1)
         {
            *w = '\0';

            // Note 4
            /* Make the whole word lowercase */
            for (w = wordbuf; *w; w++)
            {
               // Note 5
               /* tolower and toupper do work for ASCII, but not
                  for ISO-8859-1 accented letters. Flipping the third
                  bit from the left does.

                  // Note 6
                  Exception: the German sharp s is in the range for
                  uppercase and should not change, or it would
                  become a dotted y.
               */
               if (!(*w & 0x20) && *w != 0xDF)
                  *w = *w ^ 0x20;
            }
            fprintf(fpo, "%s\n", wordbuf);
         }
         inword = 0;
         w = wordbuf;
      }

      if (inword)
      {
         if (w - wordbuf < sizeof wordbuf)
            *w++ = c;
      }
   }
   return 0;
}

Some explanations:

The line reading function fgets also reads the newline character at the end of the line. Here, the lines are supposed to contain file name paths. To use these for opening the files, the newline character must be removed.
Here I don’t do pattern matching or anything like that, but I use state variables. This may be an old-fashioned, error-prone, goto-like programming style, but in my experience it works well if you do it in a straightforward manner and you keep it simple.
Does object-oriented programming have anything to do with this? If so: I’m not very fond of that. I think it is often much ado about nothing. That was my opinion in 1990 (thread Are 'friends' really necessary ??), and I never learnt anything new since then. ☺
I explicitly test on the range in ISO‑8859‑1 – hex C0 to FF – where the accented letters are for all the European languages I ever write in.
It works. But now that I write this article I wonder if this is the right approach. I started these articles after discovering that Unix sort uses locale. Does that mean sort has been programmed explicitly to take that into account? Or is that done in the string comparison and ctype.h functions in the C library? Do á, ë, õ etc. fall in the category ‘alphabetic character’ for isalpha, if the appropriate locale has been activated?
That man page says it does. I should have tested and used that. Perhaps I will, later.
If it works, can it also be made to work under Windows?
I don’t want to distinguish lowercase and uppercase in my keyword index. So I convert everything to lowercase here.
Here too, perhaps I was giving myself a hard time without any need. Does locale cover this too? Automatically without explicitly programming it? That would be great.
A recently discovered bug: the only case where an uppercase character in the ISO‑8859‑1 range for accented letters, hex C0 to DF, does not correspond to the lowercase character in the range E0 to FF, is the German letter ß.
This is understandable, because it only exists as lowercase anyway. (So why is it the range for uppercase then?) It never occurs at the beginning of a German word, so it never occurs at the beginning of a German noun, name of sentence either. These are the three possible reasons for using uppercase.
The only exception would be writing a word in all uppercase (example: writing groß as GROSS), but then, as you see, ß is replaced by SS.
Before I fixed that bug, I had some rather strange words in the index: ausschlieÿlich, auÿer, einigermaÿen, gröÿe, heiÿen, muÿ, regelmäÿig, unregelmäÿige, weiÿ. And by publishing this article, they will return, even with the bug fixed!
Contrary to popular belief, by the way, this strange character ‘dotted y’, ÿ, is NOT used in Dutch (which happens to be my native language, so I should know). It does occur, sporadically, in French geographic names (example: L'Haÿ-les-Roses) and surnames (Eugène Ysaÿe).
I also accept single quotes as the second or subsequent character of a word. Perhaps that was because some deroff man pages mention that. I can’t remember and I never thought of including ampersands.
It does make sense for English possessives (‘ John's ’) some Dutch plural (‘ pagina's ’) and maybe phonetic transcriptions with the ' as a stress marker.
My implementation also treat things like ‘ I'd ’, ‘ he'll ’ and ‘ it's ’ as one word. It is questionable if that is correct.
A bug in my implementation: if a word is enclosed in single quotes (example: 'word'), it will be extracted as word', so the first quote is removed but the second one isn't.
Inconsistency: curly quotes are not treated the same as single ASCII quotes, i.e. my implementation treats them as not belonging to a word at all.

Addition 15 July 2021:
I now also publish an enhanced version of this program, dated 22 December 2013, with support for Esperanto HTML-entities. That version was made redundant when I started using Hyperestraier in October 2021, because from then on I could use its word index also for my own index page. That 2013 word separator will in turn be superseded by a new and simpler version that supports only UTF-8, as part of my project Simple indexer.