1, 10 and , 19 September 2021. Continued from the previous.
My word separator or extractor
siworin-wordsep.c
does not fully parse
HTML, XML, XHTML or whatever, but only takes the crude measure of ignoring
anything that is between < and >. Most of the time, that works well.
However some text will be missed, although it contains words that should
have been found. Examples:
<meta name="description" content=
"Text of the short description, some 160 chars max.">
<meta name="keywords"
content="Here is a comma-separated list of keywords">
<img src="source_of_image_file.jpg"
alt="What is there to see in this picture?">
An improvement would be to do real parsing, e.g. using Google’s Gumbo library (which I do already use elsewhere), and handle all the text elements that it finds. However, I have no plans to make such a change.
Addition 27 January 2022: Later on I made this behaviour switchable
by config file, see SIWORIN_DONT_REMOVE_TAGS_WORDSEP
andSIWORIN_DONT_REMOVE_TAGS_CONTEXT
in
siworin-config.h, and
dont_remove_tags_wordsep
and
dont_remove_tags_context
in the example config file
siworin.conf.
An HTML entity is a sequence that starts with an ampersand (&) and ends with a semicolon (;), with simple ASCII in between. It can be used to encode more complicated characters. This can be done symbolically – К is a Cyrillic K, К; Κ is the Greek K, Κ – or by Unicode scalar number: ‑ or ‑ encodes a non-breaking hyphen.
The only entity I
still use within a word is ­
, for an optional hyphen, a
suggested place to hyphenate a word if the need arises. I know a
controversy exists as to the true nature of ­
, also known as
the ASCII and Unicode character 0xAD.
Some say it is a soft hyphen, used by layout software to indicate
visible hyphens that result from the hyphenation algorithm, so they
were not put there by the author of the text. In my opinion though,
that makes no sense in HTML. I use ­
as an optional hyphen.
I normally have my web texts aligned both left and right, with any automatic hyphenation turned off. If the columns have a reasonable width, that works well even for languages that write composite words together, like Dutch and German, so they tend to have some rather long words. However, there are also a lot of short words to make up for that.
Occasionally, but surprisingly rarely, this approach creates text
lines in which the words have excessive amounts of white space
between them.
Then, and only then, I place ­
hyphenation suggestions,
preferably on morphologic boundaries, and such that the resulting
parts do to differ in length too much.
Regardless of the true meaning of ­
, so far all the browsers
that I have used, understood my intention, and hyphenated accordingly,
clearing the lines of undesirable too long stretches of white space.
Now although I consider the string ­
part of the word, I do
not want it to appear in the word as extracted from the text. For
example, when in various places on my website I have the words:
verbrandingsmotoren
ver­brandings­motoren
verbrandings­motoren
verbran­dings­motoren
verbran­dingsmotoren
(Dutch for ‘combustion engines’) I do not want all five of those to
appear as index words, to be found only when entered exactly like
they are written in the text. Instead, I want just the single word
‘verbrandingsmotoren’ to be included, which finds me all
the other occurrences as well.
This can be achieved by accepting the entity or entities as part of the word, but removing them before writing the word, and its location in HTML, to the file of words. Of course this requires special care when displaying search results with context: the word in the index and the word in the actual HTML may have a different length.
(See function CalcWrdLen
, and its call in function
HighlightWords
, in source file
siworin-displ.c
, for a rather
rudimentary solution.)
In HTML, languages like German, French, Spanish, Portuguese and
Italian can be written using just plain 7-bit ASCII, and still
have all the correct accented letters. Words like überhaupt,
Köln, großartig, français, élève,
élevé, España, Ibáñez, coração,
canções, and pietà
would then be encoded as the cumbersome and ugly
überhaupt
, Köln
,
großartig
, français
,
élève
,
élevé
,
España
,
Ibáñez
,
coração
,
canções
and
pietà
.
I largely skipped that stage. With the exception of some remnants that I
now find using my own prototype Siworin
search engine, early
on I made it a habit to write in ISO 8859-1, an encoding that covers
all the languages I would ever write in, and most that I might ever cite
a word from. They are spoken in a cross on the map of Europe, from Iceland
to Albania, and from Finland to the Azores. Or that’s what I
always said, but
I know notice that the Canary Islands farther south are better suited.
But this too is a thing of the past: Unicode and UTF-8 now rule.
In the
aforementioned
earlier
version
of the word separator, I removed most entities, but kept those for
accented letters and their like intact. In the new program, I remove
only ­
, and leave everything else. As a result
they appear in the word list, and they can be found, but only by searching
for part of the code. For example, searching for atilde;
finds Covilhã in my
photo page,
where I find it is still written as Covilhã. But any
occurrences of Covilhã properly written in UTF-8 are not found that way.
Incorrect, but intended behaviour, because entities are simply
not supported.
Full support would mean converting the entities to put them in the word list as UTF-8, and perhaps add an unaccented version too – as does Hyperestraier, and it does only that. But I’m not gonna implement anything like that, sorry. Too much work, and it violates my simplicity design criterion.
I often encode n-dashes and m-dashes as entities, –
and —
. But they are rarely adjacent to alphabetic
characters. If they are, siworin-wordsep.c
will interpret
them as part of the word.
The same is true of a non-breaking space,
.
Examples in Dutch: à charge, t kofschip.
A special case is the non-breaking hyphen, ‑
,
which I sometimes use at the start of a suffix I mention, instead of a
normal hyphen, which some unfortunate hyphenation algorithm might then
put at the end of the line, all on its own. The non-breaking hyphen
appears before the first letter (i.e. alphabetic character) of the
suffix, so it won’t be included in the word list, because entity
inclusion starts only after the first letter.
More comments about the word extractor.
Copyright © 2021 by R. Harmsen, all rights reserved.