In October 2001 I rejoiced at being able to do Google searches in other scripts than just the Latin one. While making the page in HTML though, I still used entities to spell out the Hebrew and Arabic examples. I had to painstakingly look up the Unicode scalars in the code pages. Did they already exist in 2001? I can’t remember. Maybe I used other sources. Fact is that ב was a Hebrew beth and ب was an Arabic baa’. Well, they still are, but now I can also copy&paste from examples in Wikipedia.
My favourite text editor is now nano
, a further development
of the very modest text editor pico
, which started as
a part of e-mail program pine
(later: alpine
).
Today nano
can do search and replace, including with
regular expressions, it can handle multiple files at the same time,
and it is very good with large text files, which it handles fast
and efficiently. And it supports direct entry (typed if I’d install
the right keyboard layout – and by copy&paste) of text in other scripts
than the Latin script. In fact anything that Unicode supports.
Editor nano
and all modern browsers are aware that
the Hebrew and Arabic script are right-to-left, not left-to-right
like most other scripts. So when encountering text in such scripts,
they correctly display it in the appropriate direction. Often this
even works in mixed-script text, like when you write in English
but occasionally quote a word or phrase in Yiddish, Hebrew,
Arabic, Farsi, Urdu, etc.
By using the Unicode Bidirectional Algorithm, defined by unicode.org, a browser can classify a space between two Hebrew words as right-to-left because of its position between a range of inherently right-to-left characters, even though the space itself is neutral as to directionality. The same is true of digits, full stops, commas, parentheses, exclamation marks and question marks. Modern Arabic has its own reversed question mark (؟), but Modern Hebrew and Yiddish use the Latin one (?). Modern Arabic also has a special comma (،), hex 60C, which is turned, not reversed (⹁), in comparison with the Latin comma (,).
This algorithm often works quite well, but not always. In the next two chapters, I present two examples I encountered in my own web writing, where it went wrong. I also show the simple solutions I used.
In late 2020 and early 2021, first in Interlingua, then in English,
I published an article in which
I quoted text from the Yiddish song
Tumbalalaika. One line was:
וואָס איז העכער פֿון אַ הויז?,
Vos iz hekher fun a hoyz? In English that means: What is
higher than a house?
I quoted it in a table, first coded as follows (here simplified to contain just that single line):
<table> <tr> <td>וואָס איז העכער פֿון אַ הויז?</td> <td>Vos iz hekher fun a hoyz?</td> </tr> </table>
The browser rendered it like this:
וואָס איז העכער פֿון אַ הויז? | Vos iz hekher fun a hoyz? |
Just to be sure you see the same as I do, here is a screenshot (click to enlarge). The browser displays the text incorrectly: the question mark should be to at the end of the question, that is, to the left of the Yiddish word הויז, just like it is to the right of the transliterated Yiddish word ‘hoyz’, and to the English word ‘house’.
But it isn’t. Why not? Because the bidirectional algorithm doesn’t know the directionality of the question mark. After it in the HTML code, it sees HTML tags, then Latin script letters. So it assumes the question mark belongs to the Latin script left-to-right text. Here that’s wrong, the question mark belongs to the Yiddish phrase.
But it could also have been right. If I write:
Do you easily understand the meaning of the
Yiddish word hoyz, הויז?
the question mark is part of the English sentence, and is
correctly placed to the right of the Hebrew letter h (ה),
not to the left of the Hebrew letter z (ז).
The right decision depends on semantics, not on syntax alone. So unless we would want to apply AI (Artificial Intelligence), an algorithm cannot know the right answer. And AI could easily get it wrong too.
After some research, I could easily solve it by adding a Unicode right-to-left mark after the question mark. Such a mark is itself invisible, but it does have a directionality, as the name implies: right to left. By adding it after the question mark, that question mark gets enclosed between two inherently right-to-left characters: the Hebrew letter z (ז) and the right-to-left mark. This tells the algorithm that the question mark can only be part of the Yiddish, Hebrew-script text, and so must placed after, that is to the left of the Yiddish word הויז.
I encoded this special sign as ‏, I could also have used ‏ (ugly and hard to memorise), or perhaps I could have copied&pasted it from some website. But I rather won’t, because then it would be invisible in my text editor too, and I always want to be able to see what I am doing.
New coding and the resulting display in the browser window:
<table> <tr> <td>וואָס איז העכער פֿון אַ הויז?‏</td> <td>Vos iz hekher fun a hoyz?</td> </tr> </table>
וואָס איז העכער פֿון אַ הויז? | Vos iz hekher fun a hoyz? |
Screenshot of what I get to see (click to enlarge). The question mark is now at the right side, that is, at the left side! 😀
This morning I had a similar problem, this time with commas and Arabic names. I could solve it this time by adding ‎ to code a left-to-right mark. It happened in this paragraph in this section on Arabic music, in an article otherwise about French and Italian music.
Here I show an example simplified for clarity, but the problem occurred with all three Arabic names I mentioned. Code:
Umm Kulthum (<span class=fonte-arabe>أم كلثوم</span>, 1898–1975), Farid al-Atrache
Rendering:
Umm Kulthum (أم كلثوم, 1898–1975),
Farid al-Atrache
The enclosure in
<span class=fonte-arabe> … </span>
is in fact unrelated to the topic of this article. I do it because
Arabic text often looks a bit small, and it requires extra line
height if voweled. Via the CSS in
textfont.css,
.fonte-arabe {font-size: 115%; line-height: 135%;}
I make a correction for that.
My intention here was to mention the Egyptian musicians’ names in a usual Latin form, then between parentheses the original Arabic, a comma, and the years of birth and death with an n-dash (–) in between. Seems simple enough. But it went wrong. The reason is the bidirectional algorithm assumes the comma, the digits of the years, and the n-dash all belong to the Arabic text. Therefore the whole of it is rendered right to left, and the year of birth is to the right of the year of decease.
The digits within the year are still the same. That’s because in Arabic, like in Dutch and German, the number 75 is expressed as five-and-seventy, not as seventy-five as in English. So if you read 75 from left to right, you correctly get the English order, but if you read it right to left, you get, also correctly, the Arabic order of the written and spoken language, خمسة وسبعون, xamsa wa sab`ayn!
To solve the problem, and get the rendering I intended to get, I added
a left-to-right mark (‎
) before the comma, so
the algorithm knows the Arabic right-to-left text ends there
(ignoring HTML tags), and the rest is to be interpreted as text in
Latin script.
Umm Kulthum (<span class=fonte-arabe>أم كلثوم</span>‎, 1898–1975), Farid al-Atrache
Rendering:
Umm Kulthum (أم كلثوم, 1898–1975),
Farid al-Atrache
The comma and the space are now to the right of the name in Arabic, then follow the years, in the Latin-script order, for English or in this case: for Interlingua, and the closing parenthesis. That was what I wanted. Adding the mark made that clear to the algorithm in a simple manner, with the desired effect.
There are different ways to solve problems of this type.
dir=rtl
.
lang=he
for Hebrew, lang=yi
for Yiddish,
lang=ar
for Arabic.
I didn’t test if those methods also work in my examples. They probably do. They are better because they more clearly indicate what the situation is. But I personally prefer to use ‏ or ‎ because of their brevity and astuteness. See also the Arabic letter mark.
Copyright © 2023 by R. Harmsen, all rights reserved.