Idea 2 July, text 18, 26 and . Continued from the previous.
The algorithm follows from what information is requested, and the design criteria. What do we want to know? This: what words occur in the website, and where?
For this we need a list containing all the words. Of each word, it should be known in which of the HTML files it occurs, and at what byte position. Steps to create such a list are:
Create a list of HTML files to be examined. This can
be done using
find, possibly fine-tuned with a
grep -v -f stoplist to exclude some files that are not
of interest, or would cause clutter.
Open all the files in that list, and peruse them looking for words. A word is a sequence of alphabetic UTF-8 characters. Dashes (-) and apostrophes (' or ’) may occur except, at the start and the end, for English words like ‘don’t’ and ‘isn’t’. Skip anything that is in a HTML tag, between < and >. Do include entities (later to be removed), so that the presence of optional hyphenation points indicated by the entity ­ does not interrupt an alphabetic sequence that would otherwise be a word.
Write all such words to a file, one word a line, followed by an indication of where that word can be found: the line number (zero-based) in the list of files from point 1, and a byte offset in the HTML file.
Sort the list obtained from the previous step. Although the list contains words in UTF-8, for reasons explained in a next episode, [@? Add link, once known @?] it is sorted using the C locale, that is, by byte value of single bytes of the UTF-8 representation.
If a word occurs in multiple files, and/or several times in any one file, there will be multiple occurrences (i.e., lines) in the list so far obtained. To save disk space, and for easier use of the information in a word search function, those lines are now combined, so that each unique word occurring in all of the HTML files together, occupies just one line in the combined file. That word is followed by a list of pairs, each pair containing the line number of the HTML file in the list obtained in step 1.
The program that does the combining also creates a file containing fixed-length byte offsets, one per line, of the start of each line in the other file. This comes handy for the binary search in the next step.
This step 4 also suppresses excessively long lists of occurrences of trivial words like English ‘in’, ‘is’, ‘and’, ‘the’ etc., that would only take space, and aren’t interesting to look for in a website. Other languages also have such words. They are suppressed by imposing a hard limit on the number of occurrences.
A separate step, to be performed after step 1, or right now at the latest, also creates a file with byte offsets for the list of HTML files, to simplify and speed up search facilities.
Search facilities can be implemented which employ the four files thus far created. The algorithm is:
egrepto find one or more matching lines based on a regular expression.
Optionally, the words can be extracted from the list of words and locations, and compared against the words from the previous run, to see which words now occur but didn’t before, and vice versa. In other words, which words were added and deleted.
Although the algorithm was designed with HTML in mind, it probably also works with XHTML, XML in general, and even (and this case I tested) with MBX mailboxes such as those used by e-mail program Eudora.
C sources and standard tools used to implement each step are:
Now let’s have a look at the performance of this algorithm in a real life situation.