Simple word indexer (10)

13 and 14 August, 19 September 2021. Continued from the previous.

Sorting

Accented letters

Now let’s consider this output of the word separator, more or less fabricated for the sake of testing:

eleven!000014!0000027d
eleven!0001fb!000020eb
élève!00022d!0000661c
élèves!00022d!000186a7
élève!00022d!0005f310
élever!00022d!000e7e92
élevé!00022d!0034cce7
élèves!00022d!00362d14
élèves!00022d!00382f16
élève!000673!00001483
élevé!000673!00001496
eleven!000732!00000257

We see here, in the order in which they appeared in HTML files, several occurrences of the English word for 11, the French word for ‘pupil’ or ‘raises’ (élève), its plural, the French verb for ‘to raise’ (élever), and its past participle (élevé).

The idea of the Simple word indexer is to sort all equal words together, so that the next stage siworin-makelst.c can combine the above entries for élève to:

élève!22d!661c!22d!5f310!673!1483

But there’s a problem here. Just running the sort command means it uses the default locale of the system, which is en_US.UTF-8 in my case, American English in Unicode UTF-8 encoding. And that locale, like the one for French, fr_FR.UTF-8, I find treats all the letters ‘e’ as equal, whether or not there is any accent on them. So after sorting with this locale we get:

élève!00022d!0000661c
élève!00022d!0005f310
élevé!00022d!0034cce7
élève!000673!00001483
élevé!000673!00001496
eleven!000014!0000027d
eleven!0001fb!000020eb
eleven!000732!00000257
élever!00022d!000e7e92
élèves!00022d!000186a7
élèves!00022d!00362d14
élèves!00022d!00382f16

After the first two pupil words, French élève, there’s another one in line 4, because the locale-sensitive comparison function used in this sorting doesn’t recognise that élevé in line 3 is different. That may be quite OK as a sorting order for a French dictionary, but it breaks my algorithm in siworin-makelst.c, which assumes that all equal words appear together, and only then follows a different word.

So we need a different sorting order here, and that’s easy: every proper Unix system has a locale called C, also known as POSIX, which is based on the old C behaviour from the times when supporting different languages than English was a luxury nobody could afford. Or it was deemed unnecessary. This old-fashioned sorting just considers the binary value of each byte, without interpreting it in any way, and without supporting multibyte characters.

So instead of just running sort, I prepend LC_ALL=C to the command. That works in all Bourne-like shells, like sh, dash, and bash. See the makefile for an example. The result is this:

eleven!000014!0000027d
eleven!0001fb!000020eb
eleven!000732!00000257
élever!00022d!000e7e92
élevé!00022d!0034cce7
élevé!000673!00001496
élève!00022d!0000661c
élève!00022d!0005f310
élève!000673!00001483
élèves!00022d!000186a7
élèves!00022d!00362d14
élèves!00022d!00382f16

which siworin-makelst.c can easily transform to:

eleven!14!27d!1fb!20eb!732!257
élever!22d!e7e92
élevé!22d!34cce7!673!1496
élève!22d!661c!22d!5f310!673!1483
élèves!22d!186a7!22d!362d14!22d!382f16

Five unique different words, with data on where to find them. An extra advantage of this sorting order, which puts accented letters after the simple letters a to z, is that in the binary search in siworin-srchind.c (look for library function bsearch and comparison function siworin_compar) I can use a simple strcmp to determine what is greater, less or equal, without having to use special functions for wide characters. I quite prefer to reserve those for when they are really necessary.

Using different locales, that is, different collating sequences, for sorting and comparing would break the search algorithm.

It may seem strange that I treat the same data as wide characters with en_US.UTF-8 as the locale in siworin-wordsep, but as a byte encoding with locale C for sorting. In siworin-wordsep, I even use both approaches in the same source file – though not for the same stream. (On second thought, 19 August 2021, I prefer not to. See episode 13.)

If done carefully, this isn’t problematic.

Ascending locations

It’s easiest to sort on the whole line of text, without distinguishing any fields. That isn’t inefficient: comparison will end as soon as the word is found to be different, and the rest on the line will be ignored. An added advantage of this is, that for identical words, the encoding of their location will also be taken into account for the sort. So when eventually displaying search results, they will be in ascending order of words, then files, then locations within the file.

If ever somebody (maybe me?) might want to implement a proximity search – looking for words that are close to each other, but not necessarily adjacent – the sorted addresses will probably make life easier.

With such full line searching in mind, I decided to put leading zeroes in front of the addresses (see the #define SIWORIN_FORMAT1 in siworin.h), so the addresses have a fixed length up until 16.777.216 files (I have only 1875) and a maximum of 4 billion bytes per file (which is an absurd size for an HTML file).

Of course that makes the unsorted and sorted words files larger than necessary, and they are the largest files around, because they have an entry for every detected words. But they are temporary, and when connecting processing steps using pipes, they don’t even have to land on disk: everything can be done in internal memory. Temporary files for the sorting however will probably be disk files, but in the presence of ample memory, as is the case with modern computers, they will be cached, and probably already be deleted before there is time to flush them to disk.

The combining step implemented in source file siworin-makelst.c (this is step 4 of the algorithm) removes the leading zeroes again (see “sscanf(seppos,”, twice), so the file siworin.words, which will stay around to be consulted for the searches, will not be excessively large. Also, too frequent words are excluded from it (see occrlist_len <= word_max_occur in siworin-makelst.c), which further reduces file size.

Instead of using hexadecimal addresses, encoding 4 bits per character position, I could have taken advantage of an encoding scheme like base64, to have 6 bits in each. And I could have stripped 4 last bits off the addresses, so the exact location of a word would not be known, only that of a 16-byte block in which it appears, somewhere. But I deemed both to be unnecessary complications.

Choice of field separator

I used to use the pipe symbol ‘|’ as the field separator for the addresses. Found it more beautiful. Of course beauty is totally irrelevant in a file meant for computer processing, not human consumption and appreciation. But I’m a bit crazy about such things.

However, the pipe symbol caused sorting problems. As said, I sort whole lines, each containing a word including its location data. Normally, when sorting, shorter words come before longer words. First ‘create’, then ‘created’ and ‘creates’. This is because in any locale, the null character is less than any other valid character.

Now when using a separator character with a relatively high ASCII value, this ordering is disturbed. The effect could also be seen in the above French example. To make it clearer to English speakers, and to others who are less familiar with French, I’ll now present an all-English example. With ‘|’ as the separator, after sorting it would be like this:

created|00000e|000037d2
created|000048|0000045e
creates|000069|00003b5c
create|00002b|00000460
creating|000048|000010cd
creating|000049|00000deb
creations|000051|00000f0c
creation|000048|00000f0c
creative|0001c8|00000248
creative|0001c8|000004d1
creator|00022d|00166e5d
creator|00022d|00166eab

We see that ‘create’ comes after ‘created’ and ‘creates’, and the plural ‘creations’ is before ‘creation’. This wrecks the binary search algorithm, in which comparisons must be made on just the word, because no address data is known in advance. The wrong sort order is caused by ‘|’ having the ASCII value of hex 7c, higher than that of the letter z, which has 7a.

The pipe symbol by the way may also interfere with its use to separate alternatives in extended regular expressions.

When using the exclamation mark ‘!’ instead, with a ASCII value of 21, well below that of ‘a’ (61) or even ‘A’ (41; but I make all words lowercase), the sorting order becomes correct, as follows:

create!00002b!00000460
created!00000e!000037d2
created!000048!0000045e
creates!000069!00003b5c
creating!000048!000010cd
creating!000049!00000deb
creation!000048!00000f0c
creations!000051!00000f0c
creative!0001c8!00000248
creative!0001c8!000004d1
creator!00022d!00166e5d
creator!00022d!00166eab

Alternative

An alternative approach, allowing for greater freedom in the choice of the separator character, would be to not sort on whole lines, but on fields. The UNIX utility sort (1) has an option -t or --field-separator for that purpose. Then the word and the addresses could be treated separately, thus eliminating the problem that a longer word might be sorted to a position before an otherwise identical shorter word.

The problem with that however is that as far as I can see, sort (1) can treat a field as numeric data, that is as decimal data, but not as hexadecimal values. And I want my addresses to be in hex, not decimal. Or perhaps the -n option also covers hexadecimal? The man page isn’t clear about that. And I didn’t test it.

Yes, me. By 19 September 2021, the current status is that searching is possible on multiple words, such that if three or more of them occur in an HTML page, close enough together, there is a match. This minimum of three words is configurable, as is the distance between words (in bytes including invisible HTML tags).

Of course, if the minimum is three and only two search words were given, those two also suffice for a match, so the minimum is lowered in the presence of fewer words. Words not found in the index (including those that were suppressed because they are too frequent) do not count.

This makes the former only possibility, namely searching the site on a single word, a special case of a more general algorithm.

The implementation is in C source file siworin-match.c.

In addition to searching for the combination of multiple search words, you can have all occurrences of all words listed, uncombined. A special of that is when the keywords are the result of regular expression searching. Then only one pattern is allowed, and the combined method is disabled.