Automatic site index (script – update)

23 November 2012

Meanwhile

The script I described here (over a year ago! time flies when you’s having fun) has meanwhile changed a bit. Also, I only run it once every 24 hours, only at about a quarter to three at night.

Here’s the new version:

Script

cd /usr/local/www/apache24/htdocs/tools/index/

mv words wordswas

LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE

find ../.. -type f -a \
    \( -name '*.htm' -o -name '*.html' -o -name '*.stm' -o -name '*.inc' \) |
# Note 1
    grep -v -f stoppath |
    ./wordsep |
    grep -v -x -f stoplist |
    sort > wordsraw

# Note 2
# Addition 26 October 2012: for frequency list
# Not necessary to do this everyday!
#uniq -c wordsraw | sort -nr > wordfreq

# Note 3
uniq wordsraw > words
rm wordsraw
./genindex words

rm ../../index/index-?.htm
mv index*.htm ../../index

diff wordswas words

Notes

The former file ‘stoplist’ has been renamed to ‘stoppath’, because two lines further on, I now use a different file called ‘stoplist’. This new file contains 77 very common words (largely from Dutch and English, the languages I write most of my web articles in).
Ignoring such common words is of course quite usual a technique, but I didn’t do it yet, until recently (26 October 2012). The measure reduced the total number of words from the site (unsorted and including all duplicates) from 637710 to 386740!
The option -x for grep is necessary too avoid that the words ‘deal’ and ‘deactivate’, for example, would be filtered out because of the common Dutch and Portuguese word ‘de’. Only whole-line matches (between the one-word-per-line files) should be honoured for filtering out (option -v), and that is what -x does. (At least it does under FreeBSD 8.2 and 8.3, don’t know about other Unix flavours.)
I first tried -w (‘match whole words only’), but that wasn’t good enough, because then things like Al-Cercthe no longer appeared in the index, because of the common Dutch word ‘al’ (meaning: yet, already).
The above-mentioned stoplist, with frequently occuring function words, I obtained by generating a frequency list from the file (wordsraw) that has all the words in it. I commented out that step later, because I don’t need a fresh frequency list every day.
I could have further optimised the script, by doing that uniq wordsraw > words here, in an earlier step, without saving wordsraw to disk. Pipes are normally implemented using a 2 kilobyte buffer in core, so they are much more efficient than intermediate files.
The combined script line (in the existing series of piped commands), instead of:
(sort > wordsraw; uniq wordsraw > words)
would then be:
sort | uniq > words
or better still:
sort -u > words
But because the script runs only once every 24 hours, I left it as it is.