Automatic site index (script – update)

23 November 2012

Meanwhile

The script I described here (over a year ago! time flies when you’s having fun) has meanwhile changed a bit. Also, I only run it once every 24 hours, only at about q quarter to three at night.

Here’s the new version:

Script


cd /usr/local/www/apache24/htdocs/tools/index/ mv words wordswas LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE find ../.. -type f -a \ \( -name '*.htm' -o -name '*.html' -o -name '*.stm' -o -name '*.inc' \) | # Note 1 grep -v -f stoppath | ./wordsep | grep -v -x -f stoplist | sort > wordsraw # Note 2 # Addition 26 October 2012: for frequency list # Not necessary to do this everyday! #uniq -c wordsraw | sort -nr > wordfreq # Note 3 uniq wordsraw > words rm wordsraw ./genindex words rm ../../index/index-?.htm mv index*.htm ../../index diff wordswas words

Notes

  1. The former file ‘stoplist’ has been renamed to ‘stoppath’, because two lines further on, I now use a different file called ‘stoplist’. This new file contains 77 very common words (largely from Dutch and English, the languages I write most of my web articles in).

    Ignoring such common words is of course quite usual a technique, but I didn’t do it yet, until recently (26 October 2012). The measure reduced the total number of words from the site (unsorted and including all duplicates) from 637710 to 386740!

    The option -x for grep is necessary too avoid that the words ‘deal’ and ‘deactivate’, for example, would be filtered out because of the common Dutch and Portuguese word ‘de’. Only whole-line matches (between the one-word-per-line files) should be honoured for filtering out (option -v), and that is what -x does. (At least it does under FreeBSD 8.2 and 8.3, don’t know about other Unix flavours.)

    I first tried -w (‘match whole words only’), but that wasn’t good enough, because then things like Al-Cercthe no longer appeared in the index, because of the common Dutch word ‘al’ (meaning: yet, already).

  2. The above-mentioned stoplist, with frequently occuring function words, I obtained by generating a frequency list from the file (wordsraw) that has all the words in it. I commented out that step later, because I don’t need a fresh frequency list every day.

  3. I could have further optimised the script, by doing that uniq wordsraw > words here, in an earlier step, without saving wordsraw to disk. Pipes are normally implemented using a 2 kilobyte buffer in core, so they are much more efficient than intermediate files.

    The combined script line (in the existing series of piped commands), instead of:
    (sort > wordsraw; uniq wordsraw > words)
    would then be:
    sort | uniq > words
    or better still:
    sort -u > words

    But because the script runs only once every 24 hours, I left it as it is.


Colours: Neutral Weird No preference Reload screen