Exotic entities

17–19 December 2014. Idea 16 December, problem first encountered 26 November

Meta descriptions

A meta description helps Google and other search engines present found pages with a sensible indication of their contents. I never included such descriptions in my own HTML, so Google tries to compose a description from headers, links and parts of the text. Results are not always optimal.

So I added another task to my to-do list: add meta description tags everywhere. Has to be a long-term project due to the sheer number of files.

Some of the first files I thus modified were these three articles in Dutch: 1 , 2 , 3 . On 26 November 2014 I checked if Google had already picked them up. I hadn’t. I wondered if my syntax was right, so I checked it with the validator. There was an error indeed. So I corrected that error. Meanwhile Google has picked up the descriptions I supplied for those pages.

Correct syntax:
<meta name="description" content= "Text of the short description, some 160 chars max.">

So far, so good.

Serbo-Croatian

While running that syntax check, I also found another error, which is what this article is actually about. W3.org’s validator service told me you can’t officially write the Serbian name Mladić as Mladi&cacute;. I checked for more occurrences in my articles and found Drazić here.

Of course, things like á for á and é for é do work. So I thought by analogy, I could also put an acute accent on letters like c and s, to get those special characters needed for languages such as Serbian, Croation and Polish.

And indeed, that works in many browsers: Firefox (tested with 34.0.5), Chrome (39.0.2171.95m), Opera (version 12.17, build 1863). It does not work in Explorer 9 (9.0.8112.16421). And there is no requirement that it should work: here is a complete list.

Character sets

Strangely, this list does have &Scaron; (Š) and &scaron; (š), but not &Ccaron; (Č) and &ccaron; (č), and not &Zcaron; (Ž) and &zcaron; (ž).

(A caron is more often called haček. The character Š occurs in the Czech name Šimek.)

Why don’t they all work? I first thought the explanation would be in ISO-8859-1: that that character set would contain š (in upper and lower case) but not the other haček characters. Now (18 December) that I check that, I find that it isn’t true: ISO-8859-1 does not contain any of the caron characters, and Microsoft’s variant encoding CP1252 has Š/š (which have entities) but also Ž/ž (which do not officially have entities).

Basic sets and editors

So although the details vary, there is still a tendency that special characters contained in ISO-8859-1/CP1252, such as á, é, ã, ç, à, ê do have entities, and those outside those character sets (now obsoleted by Unicode) have not.

Personally I prefer to write all my pages in ISO-8859-1. That supports all the languages I ever write in: Dutch, English, German, Portuguese, Interlingua, plus rarely French and Spanish. In fact, ISO-8859-1 supports many more languages than I’ll ever need: those of a large X across Europe: from Finnish in the northeast to Spanish and Portuguese in the southwest (including Madeira, Canary Islands and Azores); and from Icelandic in the northwest via Italian to Albanian in the southeast.

As a result of the colonialism period, all of the Americas and large parts of Africa (when looking at official languages rather than local languages) are also covered.

Practically the only occasion I have to deviate from ISO-8859-1 is for demonstration purposes, and when mentioning examples in languages like Greek (here and here) or Arabic, while writing in a language I do know (not being those languages themselves).

So entities are ideal for me: I use them to escape occasionally from the normally acceptable limitations of ISO-8859-1. Therefore I don’t need &aacute (á), &agrave (à), &acirc (â) and &atilde (ã): those I type directly using the US International keyboard layout.

The entities I do need are for characters outside of ISO-8859-1. I prefer symbolic entities (e.g. &scaron; for š; ε for ε), which are clearer, over hexadecimal entities (š); and hexadecimal over decimal entities (š). Hexadecimal codes are easier to look up in Unicode code tables.

Of course I could also consistently write web pages in UTF-8 to make any Unicode characters directly accessible. But for that I need editors that support Unicode UTF-8. Notepad in Windows Vista does. Conversely, mdiNotepad by Tom Kostiainen only supports displaying Unicode letters, which unfortunately don’t always survive editing and saving.

I used to use pico on Unix (FreeBSD), and I recently switched to the more powerful nano. I managed to set the correct locale so nano now supports ISO-8859-1. It can probably be made compatible with other character sets as well, but I didn’t try very hard to get that to work. ISO-8859-1 is enough for me, and will be until I die.

Addition 3 January 2023:

I’m still alive, and since July 2019 I have been on Linux Mint, using nano as my standard text editor. Occasionally also under Ubuntu or Alpine Linux. Editor nano fully supports Unicode and UTF8, so I can use IPA for phonetics; Greek, Hebrew and Arabic for etymology and musical quotes; and Greek, Georgian (demo), and Javanese for script experiments with Interlingua, etc. I enjoy that a lot.

See also the uniformisation of fonts on the site.

Esperanto

Esperanto is a language that falls outside of the scope of ISO-8859-1, because of its u with breve and c, g, s, j and h with a circumflex. An old workaround is to write ‘ch’ for ‘ĉ’ etc., but this isn’t 100% unambiguous. For example, when converting a web page written with this convention, you must take care that the ‘ch’ in charset=ISO-8859-1 in the head isn’t also converted, to become ĉarset!

A newer convention is to use ‘cx’ instead of ‘ch’, ‘sx’ for ‘sh’, etc. But that is ugly. The real characters of course look best.

I learnt Esperanto around 1974 or 1976, didn’t do much with it later on, except making an Esperanto version of my Vzajku manifest, which was ever finished only in English, in 1988 or 1989 (no more accurate dates are available).

In December last year I computerised part of that old text, and I plan on continuing soon. For me Esperanto is only of historic interest, for new texts in a constructed language I prefer Interlingua, which can be written in simple ASCII without any special requirements. Much better.

Last year I experimentally found that &ccirc; etc worked to make a ĉ, and that &ubreve; renders as ŭ. Now a year later, I am aware (as discovered on 16 December 2014) that this too is not official: it works in Firefox, Chrome and Opera, but not in Explorer 9. And officially it need not work in any browser.

9 March 2020: see the note below.

Compliance

My intention is to follow standards as much as possible and to support many platforms and browsers. The consequence is that I would have to replace all the better readable occurrences of &scirc; for ĉ with the ugly and hard-to-understand ĉ etc. Is that really necessary? Not if W3.org did what I would deem sensible: give many more of those entities that make sense, official status, so all browsers would have to support them, not just most.

But they probably won’t listen to me. (20 October 2016: They did, probably already before I wrote my article.)

Just this morning, (19 December) a better solution occurred to me (i.e. I knew it already, but didn’t think of it recently): use Unicode’s combining diacritics, from the 0300 range. Those do also work in Internet Explorer!

This leads me to the following equivalence and conversion table:

Character name	Hexadecimal entity	Decimal entity	Unofficial mnemonic entity	With combining diacritics
Slavic languages, acute
C with acute	Ć Ć	Ć Ć	Ć &Cacute;	Ć Ć
c with acute	ć ć	ć ć	ć &cacute;	ć ć
S with acute	Ś Ś	Ś Ś	Ś &Sacute;	Ś Ś
s with acute	ś ś	ś ś	ś &sacute;	ś ś
Slavic languages, caron/haček
C with caron	Č Č	Č Č	Č &Ccaron;	Č Č
c with caron	č č	č č	č &ccaron;	č č
S with caron	Š Š	Š Š	Š &Scaron;	Š Š
s with caron	š š	š š	š &scaron;	š š
Z with caron	Ž Ž	Ž Ž	Ž &Zcaron;	Ž Ž
z with caron	ž ž	ž ž	ž &zcaron;	ž ž
Esperanto, breve
U with breve	Ŭ Ŭ	Ŭ Ŭ	Ŭ &Ubreve;	Ŭ Ŭ
u with breve	ŭ ŭ	ŭ ŭ	ŭ &ubreve;	ŭ ŭ
Esperanto, circumflex
C with circumflex	Ĉ Ĉ	Ĉ Ĉ	Ĉ &Ccirc;	Ĉ Ĉ
c with circumflex	ĉ ĉ	ĉ ĉ	ĉ &ccirc;	ĉ ĉ
G with circumflex	Ĝ Ĝ	Ĝ Ĝ	Ĝ &Gcirc;	Ĝ Ĝ
g with circumflex	ĝ ĝ	ĝ ĝ	ĝ &gcirc;	ĝ ĝ
S with circumflex	Ŝ Ŝ	Ŝ Ŝ	Ŝ &Scirc;	Ŝ Ŝ
s with circumflex	ŝ ŝ	ŝ ŝ	ŝ &scirc;	ŝ ŝ
J with circumflex	Ĵ Ĵ	Ĵ Ĵ	Ĵ &Jcirc;	Ĵ Ĵ
j with circumflex	ĵ ĵ	ĵ ĵ	ĵ &jcirc;	ĵ ĵ
H with circumflex	Ĥ Ĥ	Ĥ Ĥ	Ĥ &Hcirc;	Ĥ Ĥ
h with circumflex	ĥ ĥ	ĥ ĥ	ĥ &hcirc;	ĥ ĥ
Character name	Hexadecimal entity	Decimal entity	Unofficial mnemonic entity	With combining diacritics

Addition 20 October 2016

See also this reference table for HTML5.

Addition 9 March 2020

Search engine Hyper Estraier, which I introduced on this site in October last year, didn’t properly handle the Esperanto texts on my site. (There aren’t many, but some.) The way I encoded Esperanto’s special characters, using entities containing #x302; and #x306;, or circ;, and breve;, as listed in the table above, caused Hyper Estraier to not recognise Esperanto word boundaries.

Yesterday evening it occurred to me that UTF-8 might solve this. I tested it, and yes, that works! So today I am converting all pages that contain the encodings, so they will be using Ĉ, ĉ, ĝ, ŝ, ĵ, ŭ, etc. directly in Unicode, encoded in UTF-8.