Idea 2 July, text 18, 26 and , 19 September 2021. Continued from the previous.
The algorithm follows from what information is requested, and the design criteria. What do we want to know? This: what words occur on the website, and where?
For this we need a list containing all the words. Of each word, it should be known in which of the HTML files it occurs, and at what byte position. Steps to create such a list are:
Create a list of HTML files to be examined. This can
be done using find
, possibly fine-tuned with a
grep -v -f stoplist
to exclude some files that are not
of interest, or would cause clutter.
Open all the files in that list, and peruse them looking for words. A word is a sequence of alphabetic UTF-8 characters. Dashes (-) and apostrophes (' or ’) may occur, except at the start and the end, for English words like ‘don’t’ and ‘isn’t’. Skip anything that is in an HTML tag, between < and >. Do include entities (later to be removed), so that the presence of optional hyphenation points indicated by the entity ­ does not interrupt an alphabetic sequence that would otherwise be a word.
Write all such words to a file, one word a line, followed by an indication of where that word can be found: the line number (zero-based) in the list of files from point 1, and a byte offset in the HTML file.
Sort the list obtained from the previous step. Although the list contains words in UTF-8, for reasons explained in a next episode, it is sorted using the C locale, that is, by byte value of single bytes of the UTF-8 representation.
If a word occurs in multiple files, and/or several times in any one file, there will be multiple occurrences (i.e., lines) in the list so far obtained. To save disk space, and for easier use of the information in a word search function, those lines are now combined, so that each unique word occurring in all of the HTML files together, occupies just one line in the combined file. That word is followed by a list of pairs, each pair containing the line number of the HTML file in the list obtained in step 1.
The program that does the combining also creates a file containing fixed-length byte offsets, one per line, of the start of each line in the other file. This comes in handy for the binary search in the next step.
This step 4 also suppresses excessively long lists of occurrences of trivial words like English ‘in’, ‘is’, ‘and’, ‘the’ etc., that would only take space, and aren’t interesting to look for in a website. Other languages also have such words. They are suppressed by imposing a hard limit on the number of occurrences.
A separate step, to be performed after step 1, or right now at the latest, also creates a file with byte offsets for the list of HTML files, to simplify and speed up search facilities.
Search facilities can be implemented which employ the four files thus far created. The algorithm is:
egrep
to find one or more matching keywords
lines based on a regular expression.
Optionally, the words can be extracted from the list of words and locations, and compared against the words from the previous run, to see which words now occur but didn’t before, and vice versa. In other words, which words were added and deleted.
Although the algorithm was designed with HTML in mind, it probably also works with XHTML, XML in general, and even (and this case I tested) with MBX mailboxes such as those used by e-mail program Eudora.
C sources and standard tools used to implement each step are:
find
, grep
sort
bsearch
, egrep
,
siworin-displ.c, siworin-files.c, siworin-config.c
sed
, diff
Now let’s have a look at the performance of this algorithm in a real life situation.
Copyright © 2021 by R. Harmsen, all rights reserved.