The script I described here (over a year ago! time flies when you’s having fun) has meanwhile changed a bit. Also, I only run it once every 24 hours, only at about q quarter to three at night.
Here’s the new version:
cd /usr/local/www/apache24/htdocs/tools/index/ mv words wordswas LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE find ../.. -type f -a \ \( -name '*.htm' -o -name '*.html' -o -name '*.stm' -o -name '*.inc' \) | # Note 1 grep -v -f stoppath | ./wordsep | grep -v -x -f stoplist | sort > wordsraw # Note 2 # Addition 26 October 2012: for frequency list # Not necessary to do this everyday! #uniq -c wordsraw | sort -nr > wordfreq # Note 3 uniq wordsraw > words rm wordsraw ./genindex words rm ../../index/index-?.htm mv index*.htm ../../index diff wordswas words
The former file ‘stoplist’ has been renamed to ‘stoppath’, because two lines further on, I now use a different file called ‘stoplist’. This new file contains 77 very common words (largely from Dutch and English, the languages I write most of my web articles in).
Ignoring such common words is of course quite usual a technique, but I didn’t do it yet, until recently (26 October 2012). The measure reduced the total number of words from the site (unsorted and including all duplicates) from 637710 to 386740!
grep is necessary too avoid
that the words ‘deal’ and ‘deactivate’, for example,
would be filtered out because of the common Dutch and Portuguese word
‘de’. Only whole-line matches (between the one-word-per-line
files) should be honoured for filtering out (option
and that is what
-x does. (At least it does under FreeBSD
8.2 and 8.3, don’t know about other Unix flavours.)
I first tried
-w (‘match whole
words only’), but that wasn’t good enough, because then
no longer appeared in the index, because of the common Dutch word
‘al’ (meaning: yet, already).
stoplist, with frequently occuring function words,
by generating a frequency list from the file
wordsraw) that has all the
words in it. I commented out that step later, because
I don’t need a fresh frequency list every day.
I could have further optimised the
script, by doing that
uniq wordsraw > words
here, in an earlier step, without saving
wordsraw to disk. Pipes are normally implemented
using a 2 kilobyte buffer in core, so they are much more efficient
than intermediate files.
The combined script line (in the existing series of
piped commands), instead of:
(sort > wordsraw; uniq wordsraw > words)
would then be:
sort | uniq > words
or better still:
sort -u > words
But because the script runs only once every 24 hours, I left it as it is.