Automatic site index (script)

20 November 2011

The script

As promised, I'll reveal some details on how I automatically, periodically generate the keyword index for my website. I do that by running a shell script using crontab. The crontab file starts

SHELL=/bin/sh
TZ=:Europe/Amsterdam

and the relevant entry reads:

47 2,11,15 * * * /usr/local/apache2/htdocs/tools/index/gen

So I use the most basic of Unix shells, the Bourne shell, and set a TZ variable so the times can be defined in local Western European time.

Starting in the 47^th minute of the hours 2, 11 and 15 (i.e. 3 pm) I want the script executed.

The script is:

# Note 1
cd /usr/local/apache2/htdocs/tools/index/
# Note 2
mv words wordswas
# Note 3
LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE

# Note 4
find ../.. -type f -a \
    \( -name '*.stm' -o -name '*.htm' -o -name '*.inc' \) |
    grep -v -f stoplist |
# Note 5
    ./wordsep > wordsun ; wc wordsun
# Note 6
    sort -u < wordsun > words
# Note 7
./genindex words

# Note 8
rm ../../index/index-?.htm
mv index*.htm ../../index

# Note 9
diff wordswas words

Some explanations:

I set the working directory to the directory where the script is and where supporting executables are. So there is no separation between programs and data here. Perhaps it is better to keep them apart. In this case it doesn't seem to matter much.
The script and programs are in a directory under the document root. Normally, you wouldn’t do that. E.g. cgi-scripts (i.e. programs and scripts that visitors can activate, directly or indirectly, from the website they visit) are in a cgi-bin directory that is not in the tree where the website documents (i.e. html files etc.) are.
Although here these files are in the directory tree for the web pages, they still cannot be accessed from the outside, because the directory is only readable for its owner, not for its group or anybody else.
The mv command is connected with the diff command further on and will be explained when we arrive there.
The LC_COLLATE variable is set to get the sorting right for character set ISO‑8859‑1. This was discussed before.
I use Unix’ standard utility find to create a list of the names of files from which I want to extract words. Of course, other sites, if they did something similar, might need different or extra file name extensions. E.g. I never call an HTML file *.html, but always just *.htm. But other people do use html extensions.
The grep with the stoplist filters out some parts of the directory tree that are present but that I don’t want included in the resulting keyword index.
The wordsep program I wrote myself in C. I'll explain it in a separate web page.
I use an intermediate file so as to be able to count the rough result, the total number of words extracted from the site. It is currently almost 470.000 words or 2.8 million characters.
This is different, of course, from the total number of different words – i.e., the number that will be included in the index – which at the moment is almost 39.000 words in about 360.000 characters.
The same effect can be achieved, more elegantly, with a tee command, as follows:
./wordsep | tee wordsun | sort -u > words wc wordsun
Writing the count to stdout in the shell script means that cron will e-mail that to the owner of the crontab, i.e., me.
The total word count isn't really important of course, so when fed up with this info, I'll probably simplify this part of the script to just:
./wordsep | sort -u > words
Unix sort (man page) with the ‘-u’ option sorts the input uniquely, i.e. every word occurs only once, each on a line of its own.
I wrote genindex myself in C. I'll explain it in a separate web page. It creates several HTML files to be used in a frameset file that it also creates.
The old index files are removed, because in the new situation, the names may be different (although they are usually the same). Next, the resulting index files are moved to where the website visitors can see and use them.
This last step is unnecessary, but funny enough to keep it in. It compare the file with unique words, as created in the previous run, and kept under a different name (‘wordswas’), with the same file created in the current run.
The output of the diff command (if any) is mailed to me as the owner of the crontab. So when I put a new web page (or several) online, a few hours later I receive an e-mail showing me the words used in it, that I had never used before anywhere on the site.
Any words no longer used anywhere are also indicated in that e-mail.
In addition to being interesting and fascinating (in my view, anyway), it helped me catch typing errors a few times already, errors that I had missed when doing the usual (but sometimes inadvertently skipped) spelling check.

Automatic site index (script)

The script

23 November 2012