The script I described here (over a year ago! time flies when you’s having fun) has meanwhile changed a bit. Also, I only run it once every 24 hours, only at about a quarter to three at night.
Here’s the new version:
cd /usr/local/www/apache24/htdocs/tools/index/ mv words wordswas LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE find ../.. -type f -a \ \( -name '*.htm' -o -name '*.html' -o -name '*.stm' -o -name '*.inc' \) | # Note 1 grep -v -f stoppath | ./wordsep | grep -v -x -f stoplist | sort > wordsraw # Note 2 # Addition 26 October 2012: for frequency list # Not necessary to do this everyday! #uniq -c wordsraw | sort -nr > wordfreq # Note 3 uniq wordsraw > words rm wordsraw ./genindex words rm ../../index/index-?.htm mv index*.htm ../../index diff wordswas words
The former file ‘stoplist’ has been renamed to ‘stoppath’, because two lines further on, I now use a different file called ‘stoplist’. This new file contains 77 very common words (largely from Dutch and English, the languages I write most of my web articles in).
Ignoring such common words is of course quite usual a technique, but I didn’t do it yet, until recently (26 October 2012). The measure reduced the total number of words from the site (unsorted and including all duplicates) from 637710 to 386740!
The option -x
for grep
is necessary too avoid
that the words ‘deal’ and ‘deactivate’, for example,
would be filtered out because of the common Dutch and Portuguese word
‘de’. Only whole-line matches (between the one-word-per-line
files) should be honoured for filtering out (option -v
),
and that is what -x
does. (At least it does under FreeBSD
8.2 and 8.3, don’t know about other Unix flavours.)
I first tried -w
(‘match whole
words only’), but that wasn’t good enough, because then
things like
Al-Cercthe
no longer appeared in the index, because of the common Dutch word
‘al’ (meaning: yet, already).
The above-mentioned
stoplist
, with frequently occuring function words,
I obtained
by generating a frequency list from the file
(wordsraw
) that has all the
words in it. I commented out that step later, because
I don’t need a fresh frequency list every day.
I could have further optimised the
script, by doing that uniq wordsraw > words
here, in an earlier step, without saving
wordsraw
to disk. Pipes are normally implemented
using a 2 kilobyte buffer in core, so they are much more efficient
than intermediate files.
The combined script line (in the existing series of
piped commands), instead of:
(sort > wordsraw; uniq wordsraw > words)
would then be:
sort | uniq > words
or better still:
sort -u > words
But because the script runs only once every 24 hours, I left it as it is.