Automatic site index

Up-to-date

From now on the site index will be generated automatically several times a day (3 times, for now). I no longer need to initiate that by hand myself.

The index is now generated directly on the web server instead of on my local copy of the content files. Of course I use crontab for that.

This improvement means that new words that occur in updated or new articles will appear in the index in a matter of hours.

ISO-8859

The words are now correctly sorted according to the collating order for ISO-8859-1, which is the predominant character set on my site. Previously, the accented letters (ë, é, ê, á, ã, õ, ç, Ç, ö, ü, Ö, Ü, etc.) appeared after the z, as a result of sorting as ASCII. That was unless I moved an intermediate working file from my local copy of the site to the Unix (FreeBSD) server, and back again after properly sorting it, in order to continue the generation process.

That was because Unix supports locale and can do the sorting properly, taking the collating sequence of the character set into account. Windows can probably also do that right, but I don’t know how and frankly, I do not want to know. For text processing, good old Unix is better suited. To my taste anyway.

For example, the English word exit, Portuguese êxito (= success) and Catalan èxit – which for some strange reason is mentioned somewhere on my site – now appear near each other in the index list. Words beginning with a German ü are sorted together with words starting in u, in German or other languages.

Now that everything is done on the server, which runs FreeBSD, no manual interference is required any more to get it right.

Further reading

Details about setting the collating sequence for sorting text are given here. What I do in my Bourne shell script is:
LC_COLLATE="en_US.ISO8859-1" ; export LC_COLLATE
before running
sort -u
on a list of words that I generated from my website’s content.

Source files

See script.htm. There you’ll find the shell script and two C sources, along with some explanations.