Grepping Unicode

30 May to

Documenting my adventures

The Unix program grep, which of course is also a Linux program, is very useful for searching text files.

I make it a habit to document my computer adventures. When I have a problem with Windows, a web server or mail server, FreeBSD, Linux, hardware, etc., I routinely make notes of web pages I find that might contain a solution, then whether it improved anything or didn’t, when maybe it didn’t fit to my situation, and how the story ended: finding a full solution or maybe accept a remaining minor problem, perhaps to be solved later.

Even though I make those notes in the first language that occurs to me, sometimes English but also often my mother tongue Dutch, and the sentences and structure don’t have to be publication-ready, but only written well enough so I can still understand them weeks, months or years later myself, this takes a lot of extra time and effort, and it slows down the process of finding a solution.

But it is worth it: it happened several times that later on I ran into the same or a similar problem, and remembered I had solved it before, but not how, especially not the details. Then of course having those notes is very useful.

Searching text files

Originally I made those notes in MSWord, later Libreoffice, file name extensions rtf, doc, docx, odt. The disadvantage is that it is hard to find things in the documents. I remember seeing a way, in MSWord, to search a text in multiple files, but I didn’t manage to understand how that worked, what the idea was, and I never found anything.

At some point I stopped making such word processor documents, but used simple HTML files instead, viewable in a browser, written in a text editor. Same as what I do on the web. The big plus: HTML is text, and text is searchable by grep. Using Unix shells’ wildcards, that includes searching more than one file at a time, and with the -r command line option to grep, it even does recursive searches: it traverses directories and sub­directories, and searches all the files in them.

Of course that can be time-consuming: unfortunately there is no way to limit a recursive search to certain file name extensions like htm or html, so in a recursive search, grep will even include binary files like jpg and mp3. A wasted effort.

However, a recursive search for a constant string, of the files of my website, over 3000 of them, over 2200 of them HTML, on my not extremely fast laptop, takes only a third of a second. Negligible. Over 200 megabytes. Unbelievable but true.

Linux supports Unicode

Modern Linux systems fully support Unicode, in at least the encoding UTF8. So instead of English, you can also use Portuguese, Greek, Arabic, Hebrew, Hindi, you name it, for your search words. Probably also Japanese and Chinese, but I never tried that.

This is true of at least Linux Mint (18.3 and up) Lubuntu (20.4, 22.4) and Alpine (currently 3.10, but many before that too).

To use that, you need to be able to type the search argument, containing the special characters for the language, or you could copy it from somewhere, then paste it into the command line in the terminal program.

Special characters

But what if you don’t have the proper keyboard support, because the necessary layout isn’t there or you don’t want to install it? Or you want to investigate something that even a specialised keyboard layout cannot handle?

I’ll give an example, which is somewhat trivial because it was simplified from the real problem that made me look for a solution, which I will describe in the next chapter.

The letter á, an a with a acute accent, can be written directly as a self-contained Unicode character. It can be found in the code chart for the Latin-1 Supplement: the code, or Unicode scalar, is 00E1 hexadecimal. But an á can also be made as a composite character, consisting of a normal letter ‘a’ (hex 61), followed by hex 301 “Combining acute accent” from the Combining Diacritical Marks.

Unicode.org has rules and recommendations for what to use and how to handle variant forms, which are quite complicated: see Unicode Normalization Forms and Canonical Equivalence in Applications. I’m not sure I fully understand them, but I think the bottom line is that using á directly as a character is better than adding a separate diacritic.

So now suppose you want to make an HTML document or a whole site consistent, and see if the variant with the diacritic Unicode sign occurs anywhere? How to grep for it?

Alt-X

MSWord has a nice facility for entering any Unicode character, and I find Libreoffice Writer now also has it (the version currently installed in my Lubuntu 22.04 is 7.6.7.2): You enter the hexadecimal digits of the Unicode scalar, followed by Alt-x. So typing this:
61Alt-x301Alt-x
should produce a composite á. Except that it doesn’t, I get this character: ꌁ. Is it a mouse? No, it’s a YI SYLLABLE NZURX, whatever that is. Trouble is Libreoffice Writer first renders the letter a from 61Alt-x, and then combines it with the 301 typed after it, to make the Unicode code point A301. Not my intention.

The better way to do it: first type 301Alt-x, this produces the combining acute accent with a circle to indicate where the character to combine it with should appear. Move the cursor one position to the left and type the letter a. Et violá: á.

Now you can copy this, and paste it into the search command, to get this:
grep -r á .
In case that doesn’t work (and it may be hard with right-to-left scripts, as in my real situation, as described below), there is a another helpful trick, which I found here. Someone there recommended to type
printf '\xE2\x98\xA0'
in the command line to get a skull and crossbones symbol: .

Printf

You can also use that to compose a grep command containing special characters. But for that you need the actual bytes of the Unicode encoding, that is, the underlying UTF-8.

By the way, from my earliest encounters with Unix, starting in 1985, I do remember printf as part of the programming language C, or more correctly of its stdio library. In terms of manual pages, that is printf(3), invoked as man 3 printf. I do not remember a man 1 printf. Perhaps it is a later extension, or a built-in of bash. Or maybe it was already there back then and I have forgotten about it. Anyhow, it works in my bash 5.2.15 under Lubuntu 23.10.

So when you know the Unicode scalar, how to find out what is UTF-8 for that? You may want to try my utf8cntx or utfcntxt for that. In MSWord or Libreoffice Writer, save the text you obtained previously as a text file, encoded as UTF-8 (that is usually the default), and analyse that small text file with my tools. From that you can learn that the Unicode scalars 61-301 are represented in UTF-8 as 61-cc-81.

That makes the search command you need:
grep -r `printf '\x61\xcc\x81'` .
It works! Enclosing the printf command in backquotes (or backticks) ` and ` means: execute what’s in between first, and use the result as a parameter for the other command. This is one of the things that in 1985 made me fall in love with Unix: so smart, so handy.

My situation

The actual situation that let me discover these tricks and made me write this article, was a little bit more complicated. So now I’ll tell the real story.

I have cron periodically create a completely new index for my search engine siworin, which can optionally create a list of all the unique words that are used in my website. I compare that list against the previous run, which results in words that never occurred before, but now do. See chapter 5, the siworin algorithms, item 8. This is a great additional way to detect spelling errors: when I find an error, I correct it, so nearly all the words in the list are correct. That means that when I make a fresh spelling mistake, it is likely to appear among the newly used words that I have cron e-mail to me. Not all new words are errors, but some or many are.

On 30 June, while working on an article, two seemingly identical Yiddish words (of Polish origin) appeared in the e-mail I automatically received from my website:
בלאנדזשען
בלאָנדזשען
In fact they are not identical, couldn’t be because then the word would appear in the list only once. Here in the web article, the difference is visible: an unmarked alef vs. alef qamats, א vs. אָ. But strangely, in nano versions 7.2 (Lubuntu 23.10) and 8.0 (Alpine Linux 3.20) they looked the same: the qamats diacritic was invisible.

Using my utf8cntx or utfcntxt I found that the sign was encoded as a regular Hebrew script alef, followed by a qamats diacritic. It is also possible to encode this character (used in Yiddish to write the letter and sound o) as a single Unicode sign, from the range of presentation forms: code FB2F (hex), Hebrew letter alef with qamats, אָ, אָ.

What is better? Which way is more usual? Whatever I do, I want it to be consistent site-wide. So I wanted to find what I had used in earlier articles, without having to remember where those are, because that is something computers are better at than humans.

Performing the steps described in the previous chapter, I found out what are the Unicode scalars, their UTF-8 encodings, and the commands to look up the encoded characters. In the process, I got this output from utfcntxt:

000000 0xd7-90-0a-d7 0x0005d0: .............
000002 0x0a-d7-90-d6 0x00000a: ............
000003 0xd7-90-d6-b8 0x0005d0: ...........
000005 0xd6-b8-0a-ef 0x0005b8: ..........
000007 0x0a-ef-ac-af 0x00000a: .........
000008 0xef-ac-af-0a 0x00fb2f: ........
000011 0x0a-0a-d7-90 0x00000a: .......
000012 0x0a-d7-90-d6 0x00000a: ......
000013 0xd7-90-d6-b7 0x0005d0: .....
000015 0xd6-b7-0a-ef 0x0005b7: ....
000017 0x0a-ef-ac-ae 0x00000a: ...
000018 0xef-ac-ae-0a 0x00fb2e: ..
000021 0x0a-..-..-.. 0x00000a: .
What is it?ScalarsUTF-8Command
Plain alef 05d0 d7-90 grep -r `printf '\xd7\x90'` .
Alef qamats with diacritic 05d0-05b8d7-90 d6-b8grep -r `printf '\xd7\x90\xd6\xb8'` .
Alef qamats as presentation form fb2f ef-ac-af grep -r `printf '\xef\xac\xaf'` .
Alef patah with diacritic 05d0-05b7d7-90 d6-b7grep -r `printf '\xd7\x90\xd6\xb7'` .
Alef patah as presentation form fb2e ef-ac-ae grep -r `printf '\xef\xac\xae'` .

Result of the searches: I have indeed been inconsistent, probably as a result of quoting, i.e. copying, Yiddish text from other websites. The method with the diacritic is most frequent, the presentation forms are rare. How does the New York Yiddish newspaper Forward (פֿאָרווערטס, Forverts) do it? I’d prefer to follow that, as they use strict YIVO spelling, AFAIK.

From a sample article, I find that they do not use presentation form, only diacritics. So I’ll correct the places on my site where I didn’t.

By the way, I also found that Forverts uses the ligatures 05F0, 05F1 and 05F2 inconsistently: they often are, but the combinations of the separate letters that the ligatures connect also occur:
05F0 or 05FD-05DF
05F1 or 05FD-05D9
05F2 or 05D9-05D9
So be it.

Grep colouring

There was a problem with grep’s coloring (American spelling) of search results. I will describe that in a separate article.

Editing with sed

(This chapter added 3 June 2024.)

Editing such sequences with a text editor, or even a word processor can he hard. I renamed a *.htm file to *.htm.txt to make Libreoffice treat it as a text file, without interpreting and ruining the HTML. Then I could enter exact Unicode scalars using the trick with Alt-x. However grep still found the presentation forms after that, not the version with diacritics that I had entered. Would Libreoffice automatically change that?? That would be bad. However I’m not entirely sure what happened.

Then I had the wild idea of combining sed with printf, like this:
sed s/`printf '\xef\xac\xaf'`/`printf '\xd7\x90\xd6\xb8'`/g pres.htm > diac.htm

It looks weirdly complicated. But in fact it is just the basic and well-known global sed substitution, sed s///g, with two printf commands inserted between the slashes, each between backticks ` and `.

And it works! At least it does in my situation. Lubuntu 23.10, GNU bash version 5.2.15(1), GNU sed version 4.9, printf does not seem to have a version number of its own.