Automatic site index (generate index)

22–23 November 2011

Introduction

As promised, I'll explain the program genindex, which is used for generating a keyword index. Genindex reads the sorted list of unique words used in the web site, and transforms it into a frameset with lists of clickable links, so each keyword can be searched for, at the request of web site visitors.

Below is the source code. Longer comments are in the next section, with clickable links to and from them.

Source code


/* Author: Ruud Harmsen. Copyright © 2011, all rights reserved. */ /* This program reads the list of words made by wordsep.c, (specify file name as first command line argument) and turns them into hypertext links, which invoke CGI program s.cgi (short name because it has to be mentioned so often), which in turn will bring up a PicoSearch or Swish-e window with this search word prefilled. Genindex also divides the complete list into NumberOfIndexFiles alphabetic ranges, and balances the number of entries in them. It also generates a frame file which show the ranges next to one another, and the picosearch window below. */ #include <stdio.h> #include <stddef.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #define NumberOfIndexFiles 8 int main (int argc, char **argv) { char wordbuf[81]; FILE *fpn = NULL; FILE *fpi = NULL; FILE *fpo = NULL; int CurrentLetter = -1; long words_per_file; long words = 0; int numfiles = 0; int i; if (argc > 1) { fpi = fopen(argv[1], "rt"); } else { fprintf(stderr, "Usage: %s file-with-words\n", argv[0]); return 1; } if (!fpi) { fprintf(stderr, "Cannot open %s for reading\n", argv[1]); return 2; } fpn = fopen("index.htm", "wt"); if (!fpn) { fprintf(stderr, "Cannot open index.htm for writing\n"); return 3; } fprintf(fpn, "<html><head><title>Site index</title></head>\n"); fprintf(fpn, "<FRAMESET rows=\"70%%,30%%\" " "framespacing=0 frameborder=1 border=2>\n"); fprintf(fpn, "\t<FRAMESET cols=\""); for (i = 0; i < NumberOfIndexFiles; i++) { fprintf(fpn, "%d%%,", (int)(0.5 + 100.0 / NumberOfIndexFiles)); } /* Revoke that last comma */ fseek(fpn, -1, SEEK_CUR); fprintf(fpn, "\" framespacing=0 frameborder=1 border=4>\n"); // Note 1 while (fgets(wordbuf, sizeof wordbuf, fpi) == wordbuf) { words++; } // Note 2 words_per_file = (long)(.5 + words / (float)NumberOfIndexFiles); /* Correction because new file starts at new letter */ words_per_file -= words / 26.0 / 2.5; fseek(fpi, 0L, SEEK_SET); words = words_per_file + 1; while (fgets(wordbuf, sizeof wordbuf, fpi) == wordbuf) { char *p; int c = wordbuf[0]; if (p = strrchr(wordbuf, '\n')) *p ='\0'; // Note 3 if (words > words_per_file && numfiles < NumberOfIndexFiles) { // Note 4 /* Detect first letter change, ignore accented letters */ if (CurrentLetter == -1 || (c >= 'a' && c <= 'z' && c != CurrentLetter)) { char outname[20]; numfiles++; if (fpo) { fprintf(fpo, "</body></html>\n"); fclose(fpo); } // Note 4 if (!(c >= 'a' && c <= 'z')) c = 'a'; words -= words_per_file; sprintf(outname, "index-%c.htm", c); fpo = NULL; fpo = fopen(outname, "wt"); if (!fpo) { fprintf(stderr, "Cannot open %s for writing\n", outname); continue; } fprintf(fpo, "<html><head><title>Index %c</title>\n", toupper(c)); fprintf(fpo, "<META NAME=\"ROBOTS\" " "CONTENT=\"NOINDEX, NOFOLLOW\">\n" "<meta HTTP-EQUIV=\"Content-Type\" " "CONTENT=\"text/html; charset=ISO-8859-1\">" "<link rel=stylesheet type=\"text/css\" " "href=\"../cgi-bin/colschem.cgi?url=index.css\">" "</head><body>\n"); fprintf(fpn, "\t\t<FRAME NAME=%s SCROLLING=auto SRC=%s>\n", outname, outname); } } else { words++; } fprintf(fpo, "<A TARGET=s " "HREF=\"../cgi-bin/s.cgi?%s\">%s</A><br>\n", wordbuf, wordbuf); // Note 4 if (c >= 'a' && c <= 'z') CurrentLetter = c; } if (fpo) fclose(fpo); if (fpi) fclose(fpi); fprintf(fpn, "\t</FRAMESET>\n"); fprintf(fpn, "\t<FRAME NAME=s SCROLLING=auto " "SRC=\"../cgi-bin/s.cgi?\">\n"); fprintf(fpn, "</FRAMESET>\n</HTML>\n"); if (fpn) fclose(fpn); return 0; }

Some explanations

  1. First I read through the whole file that contains the sorted word list, just to count how many words there are. Not very efficient, but simple and clear.

  2. This calculation isn’t optimal. Neither is the whole algorithm to divide the words over a number of files with the search links. Sometimes the last file is much smaller than most of the others, sometimes bigger.

    The total number of words is known in advance, and it has been decided beforehand how many HTML files will be generated. So it should be easy.

    A complication is that I want each file to start with the first entry for one of the 26 letters of the Latin alphabet. That means that as soon as I decide that one file is full enough, I can’t immediately stop writing to it and go on to the next file. I’ll have to go on with the current file, until a new alphabet letter comes along in the input file. So I always write more entries to a file than planned.

    The formulas in the program are an attempt to compensate for this, but as said, it doesn’t work very well.

    A better algorithm should count how many more lines were written than theoretically planned, and then take that into account for future decisions.

    It is a similar algorithm as what can serve to do a calculation of amounts, involving multiplication and rounding, in horizontal table lines, in such a way that the totals in the vertical direction are also guaranteed to be correct.

    It can be done, I’ve done it before. I’ll write about it later and then apply the principle here to achieve a better result.

  3. The extra test on the condition
    numfiles < NumberOfIndexFiles
    serves as an emergency brake. Without it, due to the current imperfections of the division algorithm, it could happen that more files would be generated than NumberOfIndexFiles. Because the frameset in the also generated index.htm doesn’t have an entry for such an extra file, that would make part of the keyword index unreachable.

    So now, with the emergency brake, if this situation occurs, the last file is simply made bigger than the previous ones, instead of generating an extra file.

  4. This handling works almost perfectly. The only known (and tested) flaw is the following situation:

    Suppose a new file starts with the keywords with initial letter i, and the previous one ends with those that start with an h. Then if no entries exist with i' or i- at the start, but there are entries like a, a, a or a, these will appear before ‘ia’, as a result of sorting with the locale en_US.ISO8859-1.

    (Because the site is multilingual, sorting for a particular language wouldn’t be right, so I chose American English, although when writing in English, my preference is British English.)

    This sort order, combined with the way the program considered here decides on the break to a new file, results in a, a, a or a NOT to appear before ‘ia’, but instead after ‘hz’ in the previous file.

    This is clearly incorrect. But because in practice the problem isn’t likely to occur (I needed quite some manipulation to make it testable), I decided not to fix this bug.


Colours: Neutral Weird No preference Reload screen