2–
I was using grep
to find out more about
encodings
of Yiddish (in Hebrew script, of course, written
left to right), and I got unreadable results.
I first thought that was just me, because my Yiddish
reading skills are very limited: I spell out words like
a six-year-old. But on closer inspection, I found that
there were really parts of sentences in the wrong order,
and also swapped letters within words.
To get a better understanding of what happened, I created a very simple example.
I made a text file with a single line that contains the single
Yiddish word azoy, in Hebrew script: אַזױ, spelled
alef, patah, zayin, ligature of vav and yod.
Then I grep
ped for occurrences of the oy
character, ױ.
By default, grep
(I run it under Lubuntu 23.10)
colours its results. When I disabled that, everything
worked fine:
grep --color=never ױ filewithazoy
correctly found and displayed:
אַזױ
But when I did not disable result colouring, the oy
character was displayed correctly in red, but
in the wrong order:
ױאַז
I suppose this is caused by the escape sequences for rendering the colours (or colors, if you will, in American English). They are for SGR, Select Graphic Rendition. They contain an m and a K, as follows:
אַז<esc>[01;31m<esc>[Kױ<esc>[m<esc>[K
(I forced the Hebrew characters into left-to-right order for the occasion, by prepending a Unicode character 0x202D, left-to-right override. By <esc> I mean the ASCII escape character, hex 1B or octal 033.)
Apparently in the terminal or in bash
, Unicode’s
bidirectional algorithm is applied before
interpreting the escape sequences, so the presence of Latin
characters messes up the order of the Hebrew characters.
I think it should be the other way round: render the colours
from the escape sequences, and only then apply
the bidirectional algorithm on the Hebrew-only result.
But that’s probably easier said than done. However, browsers
do handle HTML in that manner.
What I tried, without success:
he_IL
locale, and activated it
for qterminal
and bash
.
LC_ALL
, but
also LANG
and LANGUAGE
set to
Hebrew.
grep
with --color=always
,
so it sends colouring escape sequences also into a pipe.
I wrote a little C program that adds Unicode characters
0x200F (right to left mark) before and after the line.
GREP_COLORS
, I
added the parameter ne
to the default string
ms=01;31:mc=01;31:sl=:cx=:fn=35:ln=32:bn=32:se=36
,
hoping to avoid what the grep
manual page calls
“Erase in Line (EL) to Right”.
That indeed changed the escape sequence from what it was, to
the same without all occurrences of escape K
:
אַז<esc>[01;31mױ<esc>[m
Nothing worked for me.
I wonder how people in Israel do this? Or those working with Yiddish in New York etc.?
I posted the question also in Facebook group Linux Commands, here, and in forum Superuser.
No suggestions or solutions so far, 3 June 2024 at 14:15B.
Someone on Superuser suggested to try gnome-terminal
. Even
though Lubuntu doesn’t use gnome, it does have gnome-terminal
as an installable program. So I tested version 3.49.92, which uses
VTE 0.74.0, instead of qterminal 1.3.0
, and then
grep
’s highlighting is shown correctly!!!
Problem solved, thanks!
Addition 4 June: KDE’s konsole
, apart from strange font
handling and cursor placement, also does the order in coloured
right-to-left grep
hits correctly. So the problem
seems to be specific to LXQt’s qterminal
.
Update 5 June 2024: retested with qterminal 1.4.0
under
Lubuntu 24.04: the bug persists.
Addition 10 June 2024: lxterminal
0.4.0, as included in
Bunsenlabs Linux version Boron, does not have the bug.
Interesting to see how this is with that other famous language
written right to left, Arabic. I tested with a file that
contained the name of Cairo in Arabic, القاهرة al-qaahira(t),
then grepping for hr, هر. Same result: the highlighting
is wrong in qterminal
, and right in
gnome-terminal
.
Not surprising, but good to know.
Copyright © 2024 by R. Harmsen, all rights reserved.