Grep -v? No, -L!

6 and 7 March 2013, my own translation of the Dutch original

Promotion

My latest concoction to promote my website is a showcase function. It displays prominent passages from existing articles, with the clickable article title underneath.

The idea is to arouse interest. Probably to no avail. But I have enjoyed it, the programming, marking the passages and seeing them appear. Let me do my thing; as long as it doesn’t cause any inconvenience to others, it’s all right.

Implementation

The function has three separate steps:

Marking

I mark the passages by hand. Later on in older articles, or right away, or later as well, in newer ones. In the HTML, before the text fragment to be marked, I insert a comment (meaning, obviously, it is between ) containing a recognisable unique string, and a slightly different one after the text fragment.

Collecting

Algorithm

A program written in C, run once every 24 hours through crontab, or on demand if I deem that amusing or necessary, searches all the HTML files of the whole of the website – path names are provided by find followed by some grep filtering. The program notes the language code (example: from <html lang=en> in case the web page is in English), the title (between <h1> and </h1>, although I could also have used <title>), and any text between the special marking comments.

Those text fragments, each with a prepared hyperlink underneath, are collected into HTML files, one per language. These are not directly accessible for website visitors, but only through the selection program. More on that in the next section.

Performance

The collection step is not intended to be super efficient, because it doesn’t have to run very often. The step does the hard work (or so I imagined, when I devised the multi-step approach), so the next step can be efficient and fast. That is required, because that selection step can be activated at will by visitors, beyond my control. In theory that could result in a heavy load for the web server.

Strangely though, the collection step takes only about a sixth of a second, despite using a very humble virtual server, on which only a portion of the capacity of one of the physical processor cores is available.

In that extremely short period of about 160 milliseconds (elapsed time! actual CPU time is only about 60 ms), the program ploughed through approximately 8 megabytes of HTML code in 1099 files. I wonder how it is possible that modern computers sometimes need several minutes to complete some task, such as starting up a Windows system. How can that happen if this seemingly heavy sequential search is so extremely fast?

Selecting

The selection program essentially does with a file prepared by the previous step, what these and these programs do with the site map: use a randomiser to select different entries from it in each call.

Care has been taken to guarantee that the same entry can never appear more than once for each call: this uses a simple table to remember entries that were already selected earlier. There must be an emergency brake, of course: if fewer entries were prepared than are requested, the randomised selection process must finally stop.

The selection function, like the collection function, is very fast and creates hardly any load for the web server. C, CGI, Apache and FreeBSD make a swift team, with myself being the team captain. But don’t forget the hardware and the virtualisation implementation by Tilaa.com.

Which ones not yet, so yet to be?

Heading towards an algorithm

I was curious about what files already had showcase markings, and which ones hadn’t so I could think of adding them. So I botched together a little shell script:

fgrep -rl "Destaque Rudhar.com" $HOMEBASE | grep -v -f ~/bin/nodestaq-stop | sort

I recursively (-r) search all the files, starting at the root directory of files of the website – ‘recursive’ meaning all files in any subdirectories are also searched. The option -l means I don’t want to see the text strings found by grep, but rather, and only, the file names (path names) of the files in which something was found.

Performance

After that I use a pipe (‘|’) to filter out lots of patterns, paths and files, of which I already know I won’t be wanting any markings in them. Examples are JPG, WAV and MP3 files. This makes the whole thing pretty inefficient: a lot of data is searched even though it is known in advance that nothing sensible will ever come out of it, followed by a lot of file name filtering (especially when doing the reverse search, as described later).

Grep as far as I know cannot combine the -r option with file names or file name patterns, so the range of the search could be restricted. The arguments for grep in this case are directories, serving as a starting point for recursively searching the hierarchical file tree.

I could also have turned the script front to back: first filter find’s output, then use xargs to make grep search only relevant files. That would look like this:

find $HOMEBASE -type f | grep -v -f ~/bin/nodestaq-stop | xargs fgrep -l "Destaque Rudhar.com" | sort

But that too is inefficient, because my stoplist is rather long, containing 130 lines. For each input line to grep, all those 130 regular expressions must be tested to see if there is a reason for skipping the next step. That takes time. (Once a reason has been found, it is no longer necessary to look for more; but still.)

Measured times (both based on the reversed search I’ll be discussing shortly): 1,15 seconds of actual CPU time for the second method, versus 1,28 for the first. Not worth the optimisation effort, especially considering that the script will run very infrequently.

The numbers tell the tale.

Reversed search

So far I was looking for files that did have the showcase text markings. Works very well. But what I really wanted to know was where they did NOT yet occur, so it might make sense to add them.

That’s simple, I thought, I just reverse the grep by means of the well-known reversal option -v, the one I also used above to exclude some file names from the search.

But it didn’t work. HTML files that I know had markings, appeared in the original list (correctly), but ALSO in the resulting list after adding -v. How could that happen?

I decided to test it with a simple example and accurately read the manual. Search man grep.

This is the point: the option -v does do a reversal: “Invert the sense of matching, to select non-matching lines.”

It says “lines”, not “files”. Instead of getting lines with markings (if I hadn’t asked for only file names by specifying -l), I get lines without markings instead. So in both cases there would be output. And the option -l means:

“Suppress normal output; instead print the name of each input file from which output would normally have been printed.”

In my example, both with and without -v there is output, so -l always produces output. The only case in which that would be different is if an HTML file only contains lines with markings and no other lines at all.

The option -v does reverse things, but not in the way that I need it. Fortunately, there is also an option that does do what I want. I don’t remember it from my earliest encounter with Unix, between 1985 and 1990. Perhaps it didn’t exist yet, or else I never noticed it? I don’t know and I won’t look it up.

The description of the option -L is different from that of -l in just one word: “no”.

“Suppress normal output; instead print the name of each input file from which no output would normally have been printed.”

This is the solution indeed: in the script that finds files with markings, don’t add -v, but instead replace -l with -L.

Case

Unix is case sensitive: the difference between uppercase and lowercase letter is important and changes meaning. Therefore purists always write command names like grep in lowercase, even in running text when the name appears at the start of a sentence. I thought this is also true of man pages, but it isn’t. Or rather, sometimes it isn’t, like here on the web; but often it is, like in the command line of FreeBSD 8.3:

“grep searches the named input FILEs (or standard input if no files are named, or the file name - is given) for lines containing a match to the given PATTERN. By default, grep prints the matching lines.”