13 and , 19 September 2021. Continued from the previous.
Now let’s consider this output of the word separator, more or less fabricated for the sake of testing:
eleven!000014!0000027d eleven!0001fb!000020eb élève!00022d!0000661c élèves!00022d!000186a7 élève!00022d!0005f310 élever!00022d!000e7e92 élevé!00022d!0034cce7 élèves!00022d!00362d14 élèves!00022d!00382f16 élève!000673!00001483 élevé!000673!00001496 eleven!000732!00000257
We see here, in the order in which they appeared in HTML files, several occurrences of the English word for 11, the French word for ‘pupil’ or ‘raises’ (élève), its plural, the French verb for ‘to raise’ (élever), and its past particle (élevé).
The idea of the Simple word indexer is to sort all equal words together,
so that the next stage siworin-makelst.c
can combine
the above entries for élève to:
élève!22d!661c!22d!5f310!673!1483
But there’s a problem here. Just running the sort
command
means it uses the default locale
of the system, which
is en_US.UTF-8
in my case, American English in Unicode
UTF-8 encoding.
And that locale, like the one for French, fr_FR.UTF-8
,
I find treats all the letters ‘e’ as equal, whether or not there
is any accent on them. So after sorting with this locale we get:
élève!00022d!0000661c élève!00022d!0005f310 élevé!00022d!0034cce7 élève!000673!00001483 élevé!000673!00001496 eleven!000014!0000027d eleven!0001fb!000020eb eleven!000732!00000257 élever!00022d!000e7e92 élèves!00022d!000186a7 élèves!00022d!00362d14 élèves!00022d!00382f16
After the first two pupil words, French élève, there’s another
one in line 4, because the locale-sensitive comparison function used
in this sorting doesn’t recognise that élevé in line 3 is
different. That may be quite OK as a sorting order for a French
dictionary, but it breaks my algorithm in siworin-makelst.c
,
which assumes that all equal words appear together, and only then
follows a different word.
So we need a different sorting order here, and that’s easy: every
proper Unix system has a locale called C
, also known as
POSIX
, which is based on the old C behaviour from the
times when supporting different languages than English was a luxury
nobody could afford. Or it was deemed unnecessary. This old-fashioned
sorting just considers
the binary value of each byte, without interpreting it in any way,
and without supporting multibyte characters.
So instead of just running sort
,
I prepend LC_ALL=C to the command. That works in all Bourne-like shells,
like sh
, dash
, and bash
. See the
makefile for an example.
The result is this:
eleven!000014!0000027d eleven!0001fb!000020eb eleven!000732!00000257 élever!00022d!000e7e92 élevé!00022d!0034cce7 élevé!000673!00001496 élève!00022d!0000661c élève!00022d!0005f310 élève!000673!00001483 élèves!00022d!000186a7 élèves!00022d!00362d14 élèves!00022d!00382f16which
siworin-makelst.c
can easily transform to:
eleven!14!27d!1fb!20eb!732!257 élever!22d!e7e92 élevé!22d!34cce7!673!1496 élève!22d!661c!22d!5f310!673!1483 élèves!22d!186a7!22d!362d14!22d!382f16
Five unique different words, with data on where to find them.
An extra advantage of this sorting order, which puts accented
letters after the simple letters a to z, is that in the binary
search in siworin-srchind.c
(look for
library function bsearch
and comparison function
siworin_compar
) I can use a simple
strcmp
to determine what is greater, less or equal,
without having to use special functions for wide characters.
I quite prefer to reserve those for when they are really necessary.
Using different locales, that is, different collating sequences, for sorting and comparing would break the search algorithm.
It may seem strange that I treat the same data as wide characters
with en_US.UTF-8
as the locale in
siworin-wordsep
, but as a byte encoding with locale
C
for sorting. In siworin-wordsep
, I even
use both approaches in the same source file – though not for
the same stream.
(On second thought, 19 August 2021, I prefer not to. See
episode 13.)
If done carefully, this isn’t problematic.
It’s easiest to sort on the whole line of text, without distinguishing any fields. That isn’t inefficient: comparison will end as soon as the word is found to be different, and the rest on the line will be ignored. An added advantage of this is, that for identical words, the encoding of their location will also be taken into account for the sort. So when eventually displaying search results, they will be in ascending order of words, then files, then locations within the file.
If ever somebody (maybe me?) might want to implement a proximity search – looking for words that are close to each other, but not necessarily adjacent – the sorted addresses will probably make life easier.
With such full line searching in mind, I decided to put leading
zeroes in front of the addresses (see the #define
SIWORIN_FORMAT1
in siworin.h
), so the
addresses have a fixed length up until 16.777.216 files (I have
only 1875) and
a maximum of 4 billion bytes per file (which is an absurd size
for an HTML file).
Of course that makes the unsorted and sorted words files larger than necessary, and they are the largest files around, because they have an entry for every detected words. But they are temporary, and when connecting processing steps using pipes, they don’t even have to land on disk: everything can be done in internal memory. Temporary files for the sorting however will probably be disk files, but in the presence of ample memory, as is the case with modern computers, they will be cached, and probably already be deleted before there is time to flush them to disk.
The combining step implemented in source file
siworin-makelst.c
(this is step 4
of the algorithm) removes the leading zeroes again (see
“sscanf(seppos,
”, twice), so the file
siworin.words
, which will stay around to be consulted
for the searches, will not be excessively large. Also, too frequent
words are excluded from it (see
occrlist_len <= word_max_occur
in
siworin-makelst.c
), which further reduces file size.
Instead of using hexadecimal addresses, encoding 4 bits per character position, I could have taken advantage of an encoding scheme like base64, to have 6 bits in each. And I could have stripped 4 last bits off the addresses, so the exact location of a word would not be known, only that of a 16-byte block in which it appears, somewhere. But I deemed both to be unnecessary complications.
I used to use the pipe symbol ‘|’ as the field separator for the addresses. Found it more beautiful. Of course beauty is totally irrelevant in a file meant for computer processing, not human consumption and appreciation. But I’m a bit crazy about such things.
However, the pipe symbol caused sorting problems. As said, I sort whole lines, each containing a word including its location data. Normally, when sorting, shorter words come before longer words. First ‘create’, then ‘created’ and ‘creates’. This is because in any locale, the null character is less than any other valid character.
Now when using a separator character with a relatively high ASCII value, this ordering is disturbed. The effect could also be seen in the above French example. To make it clearer to English speakers, and to others who are less familiar with French, I’ll now present an all-English example. With ‘|’ as the separator, after sorting it would be like this:
created|00000e|000037d2 created|000048|0000045e creates|000069|00003b5c create|00002b|00000460 creating|000048|000010cd creating|000049|00000deb creations|000051|00000f0c creation|000048|00000f0c creative|0001c8|00000248 creative|0001c8|000004d1 creator|00022d|00166e5d creator|00022d|00166eab
We see that ‘create’ comes after ‘created’ and ‘creates’, and the plural ‘creations’ is before ‘creation’. This wrecks the binary search algorithm, in which comparisons must be made on just the word, because no address data is known in advance. The wrong sort order is caused by ‘|’ having the ASCII value of hex 7c, higher than that of the letter z, which has 7a.
The pipe symbol by the way may also interfere with its use to separate alternatives in extended regular expressions.
When using the exclamation mark ‘!’ instead, with a ASCII value of 21, well below that of ‘a’ (61) or even ‘A’ (41; but I make all words lowercase), the sorting order becomes correct, as follows:
create!00002b!00000460 created!00000e!000037d2 created!000048!0000045e creates!000069!00003b5c creating!000048!000010cd creating!000049!00000deb creation!000048!00000f0c creations!000051!00000f0c creative!0001c8!00000248 creative!0001c8!000004d1 creator!00022d!00166e5d creator!00022d!00166eab
An alternative approach, allowing for greater freedom in
the choice of the separator character, would be to not sort
on whole lines, but on fields. The UNIX utility
sort (1)
has an option -t
or
--field-separator
for that purpose. Then
the word and the addresses could be treated separately,
thus eliminating the problem that a longer word might be
sorted to a position before an otherwise identical shorter
word.
The problem with that however is that as far as I can see,
sort (1)
can treat a field as numeric data,
that is as decimal data, but not as hexadecimal
values. And I want my addresses to be in hex, not decimal.
Or perhaps the -n
option also covers hexadecimal?
The man page isn’t clear about that. And I didn’t test it.
Yes, me. By 19 September 2021, the current status is that searching is possible on multiple words, such that if three or more of them occur in an HTML page, close enough together, there is a match. This minimum of three words is configurable, as is the distance between words (in bytes including invisible HTML tags).
Of course, if the minimum is three and only two search words were given, those two also suffice for a match, so the minimum is lowered in the presence of fewer words. Words not found in the index (including those that were suppressed because they are too frequent) do not count.
This makes the former only possibility, namely searching the site on a single word, a special case of a more general algorithm.
The implementation is in C source file siworin-match.c
.
In addition to searching for the combination of multiple search words, you can have all occurrences of all words listed, uncombined. A special of that is when the keywords are the result of regular expression searching. Then only one pattern is allowed, and the combined method is disabled.
Copyright © 2021 by R. Harmsen, all rights reserved.