Simple word indexer (8)

1, 10 and , 19 September 2021. Continued from the previous.

Word separation (2)


My word separator or extractor siworin-wordsep.c does not fully parse HTML, XML, XHTML or whatever, but only takes the crude measure of ignoring anything that is between < and >. Most of the time, that works well. However some text will be missed, although it contains words that should have been found. Examples:

An improvement would be to do real parsing, e.g. using Google’s Gumbo library (which I do already use elsewhere), and handle all the text elements that it finds. However, I have no plans to make such a change.

Addition 27 January 2022: Later on I made this behaviour switchable by config file, see
SIWORIN_DONT_REMOVE_TAGS_CONTEXT in siworin-config.h, and
dont_remove_tags_context in the example config file siworin.conf.



An HTML entity is a sequence that starts with an ampersand (&) and ends with a semicolon (;), with simple ASCII in between. It can be used to encode more complicated characters. This can be done symbolically – &Kcy; is a Cyrillic K, К; &Kappa; is the Greek K, Κ – or by Unicode scalar number: &#x2011; or &#8209; encodes a non-breaking hyphen.

The only entity I still use within a word is &shy;, for an optional hyphen, a suggested place to hyphenate a word if the need arises. I know a controversy exists as to the true nature of &shy;, also known as the ASCII and Unicode character 0xAD. Some say it is a soft hyphen, used by layout software to indicate visible hyphens that result from the hyphenation algorithm, so they were not put there by the author of the text. In my opinion though, that makes no sense in HTML. I use &shy; as an optional hyphen.

I normally have my web texts aligned both left and right, with any automatic hyphenation turned off. If the columns have a reasonable width, that works well even for languages that write composite words together, like Dutch and German, so they tend to have some rather long words. However, there are also a lot of short words to make up for that.

Occasionally, but surprisingly rarely, this approach creates text lines in which the words have excessive amounts of white space between them. Then, and only then, I place &shy; hyphenation suggestions, preferably on morphologic boundaries, and such that the resulting parts do to differ in length too much. Regardless of the true meaning of &shy;, so far all the browsers that I have used, understood my intention, and hyphenated accordingly, clearing the lines of undesirable too long stretches of white space.

Now although I consider the string &shy; part of the word, I do not want it to appear in the word as extracted from the text. For example, when in various places on my website I have the words:
(Dutch for ‘combustion engines’) I do not want all five of those to appear as index words, to be found only when entered exactly like they are written in the text. Instead, I want just the single word ‘verbrandingsmotoren’ to be included, which finds me all the other occurrences as well.

This can be achieved by accepting the entity or entities as part of the word, but removing them before writing the word, and its location in HTML, to the file of words. Of course this requires special care when displaying search results with context: the word in the index and the word in the actual HTML may have a different length.

(See function CalcWrdLen, and its call in function HighlightWords, in source file siworin-displ.c, for a rather rudimentary solution.)

&auml;, &atilde;, &ccedil;, etc.

In HTML, languages like German, French, Spanish, Portuguese and Italian can be written using just plain 7-bit ASCII, and still have all the correct accented letters. Words like überhaupt, Köln, großartig, français, élève, élevé, España, Ibáñez, coração, canções, and pietà would then be encoded as the cumbersome and ugly &uuml;berhaupt, K&ouml;ln, gro&szlig;artig, fran&ccedil;ais, &eacute;l&egrave;ve, &eacute;lev&eacute;, Espa&ntilde;a, Ib&aacute;&ntilde;ez, cora&ccedil;&atilde;o, can&ccedil;&otilde;es and piet&agrave;.

I largely skipped that stage. With the exception of some remnants that I now find using my own prototype Siworin search engine, early on I made it a habit to write in ISO 8859-1, an encoding that covers all the languages I would ever write in, and most that I might ever cite a word from. They are spoken in a cross on the map of Europe, from Iceland to Albania, and from Finland to the Azores. Or that’s what I always said, but I know notice that the Canary Islands farther south are better suited.

But this too is a thing of the past: Unicode and UTF-8 now rule.

In the aforementioned earlier version of the word separator, I removed most entities, but kept those for accented letters and their like intact. In the new program, I remove only &shy;, and leave everything else. As a result they appear in the word list, and they can be found, but only by searching for part of the code. For example, searching for atilde; finds Covilhã in my photo page, where I find it is still written as Covilh&atilde;. But any occurrences of Covilhã properly written in UTF-8 are not found that way. Incorrect, but intended behaviour, because entities are simply not supported.

Full support would mean converting the entities to put them in the word list as UTF-8, and perhaps add an unaccented version too – as does Hyperestraier, and it does only that. But I’m not gonna implement anything like that, sorry. Too much work, and it violates my simplicity design criterion.

&ndash;, &nbsp; &#x2011;

I often encode n-dashes and m-dashes as entities, &ndash; and &mdash;. But they are rarely adjacent to alphabetic characters. If they are, siworin-wordsep.c will interpret them as part of the word.

The same is true of a non-breaking space, &nbsp;. Examples in Dutch: à&nbsp;charge, t&nbsp;kofschip.

A special case is the non-breaking hyphen, &#x2011;, which I sometimes use at the start of a suffix I mention, instead of a normal hyphen, which some unfortunate hyphenation algorithm might then put at the end of the line, all on its own. The non-breaking hyphen appears before the first letter (i.e. alphabetic character) of the suffix, so it won’t be included in the word list, because entity inclusion starts only after the first letter.

More comments about the word extractor.