20 November 2011
SHELL=/bin/sh TZ=:Europe/Amsterdamand the relevant entry reads:
47 2,11,15 * * * /usr/local/apache2/htdocs/tools/index/gen
So I use the most basic of Unix shells, the Bourne shell, and set a TZ variable so the times can be defined in local Western European time.
Starting in the 47th minute of the hours 2, 11 and 15 (i.e. 3 pm) I want the script executed.
The script is:
# Note 1 cd /usr/local/apache2/htdocs/tools/index/ # Note 2 mv words wordswas # Note 3 LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE # Note 4 find ../.. -type f -a \ \( -name '*.stm' -o -name '*.htm' -o -name '*.inc' \) | grep -v -f stoplist | # Note 5 ./wordsep > wordsun ; wc wordsun # Note 6 sort -u < wordsun > words # Note 7 ./genindex words # Note 8 rm ../../index/index-?.htm mv index*.htm ../../index # Note 9 diff wordswas words
I set the working directory to the directory where the script is and where supporting executables are. So there is no separation between programs and data here. Perhaps it is better to keep them apart. In this case it doesn't seem to matter much.
The script and programs are in a directory under the document root. Normally, you wouldn’t do that. E.g. cgi-scripts (i.e. programs and scripts that visitors can activate, directly or indirectly, from the website they visit) are in a cgi-bin directory that is not in the tree where the website documents (i.e. html files etc.) are.
Although here these files are in the directory tree for the web pages, they still cannot be accessed from the outside, because the directory is only readable for its owner, not for its group or anybody else.
I use Unix’ standard utility find to create a list of the names of files from which I want to extract words. Of course, other sites, if they did something similar, might need different or extra file name extensions. E.g. I never call an HTML file *.html, but always just *.htm. But other people do use html extensions.
The grep with the stoplist filters out some parts of the directory tree that are present but that I don’t want included in the resulting keyword index.
I use an intermediate file so as to be able to count the rough result, the total number of words extracted from the site. It is currently almost 470.000 words or 2.8 million characters.
This is different, of course, from the total number of different words – i.e., the number that will be included in the index – which at the moment is almost 39.000 words in about 360.000 characters.
The same effect can be achieved, more elegantly,
command, as follows:
./wordsep | tee wordsun | sort -u > words
Writing the count to stdout in the shell script means that cron will e-mail that to the owner of the crontab, i.e., me.
The total word count isn't really important of course,
so when fed up with this info, I'll probably simplify this
part of the script to just:
./wordsep | sort -u > words
The old index files are removed, because in the new situation, the names may be different (although they are usually the same). Next, the resulting index files are moved to where the web site visitors can see and use them.
This last step is unnecessary, but funny enough to keep it in. It compare the file with unique words, as created in the previous run, and kept under a different name (‘wordswas’), with the same file created in the current run.
The output of the diff command (if any) is mailed to me as the owner of the crontab. So when I put a new web page (or several) online, a few hours later I receive an e-mail showing me the words used in it, that I had never used before anywhere on the site.
Any words no longer used anywhere are also indicated in that e-mail.
In addition to being interesting and fascinating (in my view, anyway), it helped me catch typing errors a few times already, errors that I had missed when doing the usual (but sometimes inadvertently skipped) spelling check.
See also this update.