6 and 7 March 2013, my own translation of the Dutch original
The idea is to arouse interest. Probably to no avail. But I have enjoyed it, the programming, marking the passages and seeing them appear. Let me do my thing; as long as it doesn’t cause any inconvenience to others, it’s all right.
The function has three separate steps:
I mark the passages by hand. Later on in older articles, or right away, or later as well, in newer ones. In the HTML, before the text fragment to be marked, I insert a comment (meaning, obviously, it is between <!-- and -->) containing a recognisable unique string, and a slightly different one after the text fragment.
A program written in C, run once every 24 hours through
crontab, or on demand if I deem that amusing
or necessary, searches all the HTML files of the whole of
the web site – path names are provided by
find followed by some
filtering. The program notes the language code
(example: from <html lang=en> in case the web page
is in English), the title (between <h1> and </h1>,
although I could also have used <title>), and any
text between the special marking comments.
Those text fragments, each with a prepared hyperlink underneath, are collected into HTML files, one per language. These are not directly accessible for web site visitors, but only through the selection program. More on that in the next section.
The collection step is not intended to be super efficient, because it doesn’t have to run very often. The step does the hard work (or so I imagined, when I devised the multi-step approach), so the next step can be efficient and fast. That is required, because that selection step can be activated at will by visitors, beyond my control. In theory that could result in a heavy load for the web server.
Strangely though, the collection step takes only about a sixth of a second, despite using a very humble virtual server, on which only a portion of the capacity of one of the physical processor cores is available.
In that extremely short period of about 160 milliseconds (elapsed time! actual CPU time is only about 60 ms), the program ploughed through approximately 8 megabytes of HTML code in 1099 files. I wonder how it is possible that modern computers sometimes need several minutes to complete some task, such as starting up a Windows system. How can that happen if this seemingly heavy sequential search is so extremely fast?
The selection program essentially does with a file prepared by the previous step, what these and these programs do with the site map: use a randomiser to select different entries from it in each call.
Care has been taken to guarantee that the same entry can never appear more than once for each call: this uses a simple table to remember entries that were already selected earlier. There must be an emergency brake, of course: if fewer entries were prepared than are requested, the randomised selection process must finally stop.
The selection function, like the collection function, is very fast and creates hardly any load for the web server. C, CGI, Apache and FreeBSD make a swift team, with myself being the team captain. But don’t forget the hardware and the virtualisation implementation by Tilaa.com.
I was curious about what files already had showcase markings, and which ones hadn’t so I could think of adding them. So I botched together a little shell script:
fgrep -rl "Destaque Rudhar.com" $HOMEBASE |
grep -v -f ~/bin/nodestaq-stop | sort
I recursively (
-r) search all the files, starting
at the root directory of files of the web site –
‘recursive’ meaning all files in any subdirectories
are also searched. The option
-l means I don’t
want to see the text strings found by
rather, and only, the file names (path names) of the files in
which something was found.
After that I use a
pipe (‘|’) to
filter out lots of patterns, paths and files, of which I already
know I won’t be wanting any markings in them.
Examples are JPG, WAV and MP3 files.
This makes the whole thing pretty inefficient: a lot of data is
searched even though it is known in advance that nothing sensible
will ever com out of it, followed by a lot of file name filtering (especially when doing the reverse search, as described later).
Grep as far as I know cannot combine the
option with file names or file name patterns, so the range of the
search could be restricted.
The arguments for
grep in this case are directories,
serving as a starting point for recursively searching the
hierarchical file tree.
I could also have turned the script front to back: first filter
find’s output, then use
grep search only relevant files.
That would look like this:
find $HOMEBASE -type f | grep -v -f ~/bin/nodestaq-stop |
xargs fgrep -l "Destaque Rudhar.com" | sort
But that too is inefficient, because my
is rather long, containing 130 lines. For each input line to
grep, all those 130 regular expressions must be
tested to see if there is a reason for skipping the next step.
That takes time. (Once a reason has been found, it is no longer
necessary to look for more; but still.)
Measured times (both based on the reversed search I’ll be discussing shortly): 1,15 seconds of actual CPU time for the second method, versus 1,28 for the first. Not worth the optimisation effort, especially considering that the script will run very infrequently.
The numbers tell the tale.
So far I was looking for files that did have the showcase text markings. Works very well. But what I really wanted to know was where they did NOT yet occur, so it might make sense to add them.
That’s simple, I thought, I just reverse the
grep by means of the well-known reversal option
-v, the one I also used above to exclude some
file names from the search.
But it didn’t work. HTML files that I know had markings,
appeared in the original list (correctly), but ALSO in the
resulting list after adding
-v. How could that
I decided to test it with a simple example and accurately
read the manual.
This is the point: the option
-v does do a
reversal: “Invert the sense of
matching, to select non-matching lines.”
It says “lines”, not
“files”. Instead of
getting lines with markings (if I
hadn’t asked for only file names by specifying
-l), I get lines without markings instead.
So in both cases there would be output. And the option
“Suppress normal output; instead print the name of each input file from which output would normally have been printed.”
In my example, both with and without
there is output, so
produces output. The only case in which that would be
different is if an HTML file only contains lines with
markings and no other lines at all.
-v does reverse things, but not in
the way that I need it. Fortunately, there is also an option
that does do what I want. I don’t remember it from my
earliest encounter with Unix, between 1985 and 1990.
Perhaps it didn’t exist yet, or else I never noticed it?
I don’t know and I won’t look it up.
The description of the option
-L is different
from that of
-l in just one word:
“Suppress normal output; instead print the name of each input file from which no output would normally have been printed.”
This is the solution indeed: in the script that finds files
with markings, don’t add
-v, but instead replace
Unix is case sensitive: the difference between uppercase and
lowercase letter is important and changes meaning. Therefore
purists always write command names like
lowercase, even in running text when the name appears at the
start of a sentence.
I thought this is also true of man pages, but it isn’t.
Or rather, sometimes it isn’t, like here
on the web; but often it is, like in the
command line of FreeBSD 8.3:
grep searches the named input
FILEs (or standard input if no files are named, or the
file name - is given) for lines containing a match to the
given PATTERN. By default, grep prints the matching