30 May to
The Unix program grep
, which of course is also
a Linux program, is very useful for searching text files.
I make it a habit to document my computer adventures. When I have a problem with Windows, a web server or mail server, FreeBSD, Linux, hardware, etc., I routinely make notes of web pages I find that might contain a solution, then whether it improved anything or didn’t, when maybe it didn’t fit to my situation, and how the story ended: finding a full solution or maybe accept a remaining minor problem, perhaps to be solved later.
Even though I make those notes in the first language that occurs to me, sometimes English but also often my mother tongue Dutch, and the sentences and structure don’t have to be publication-ready, but only written well enough so I can still understand them weeks, months or years later myself, this takes a lot of extra time and effort, and it slows down the process of finding a solution.
But it is worth it: it happened several times that later on I ran into the same or a similar problem, and remembered I had solved it before, but not how, especially not the details. Then of course having those notes is very useful.
Originally I made those notes in MSWord, later Libreoffice,
file name extensions rtf
, doc
,
docx
, odt
. The disadvantage
is that it is hard to find things in the documents.
I remember seeing a way, in MSWord, to search a text
in multiple files, but I didn’t manage to understand
how that worked, what the idea was, and I never found
anything.
At some point I stopped making such word processor documents,
but used simple HTML files instead, viewable in a browser,
written in a text editor. Same as what I do on the web.
The big plus: HTML is text, and text is searchable by
grep
. Using Unix shells’ wildcards, that
includes searching more than one file at a time, and with
the -r
command line option to grep
,
it even does recursive searches: it traverses directories
and subdirectories, and searches all the files in them.
Of course that can be time-consuming: unfortunately there is
no way to limit a recursive search to certain file name
extensions like htm
or html
,
so in a recursive search, grep
will even include
binary files like jpg
and mp3
.
A wasted effort.
However, a recursive search for a constant string, of the files of my website, over 3000 of them, over 2200 of them HTML, on my not extremely fast laptop, takes only a third of a second. Negligible. Over 200 megabytes. Unbelievable but true.
Modern Linux systems fully support Unicode, in at least the encoding UTF8. So instead of English, you can also use Portuguese, Greek, Arabic, Hebrew, Hindi, you name it, for your search words. Probably also Japanese and Chinese, but I never tried that.
This is true of at least Linux Mint (18.3 and up) Lubuntu (20.4, 22.4) and Alpine (currently 3.10, but many before that too).
To use that, you need to be able to type the search argument, containing the special characters for the language, or you could copy it from somewhere, then paste it into the command line in the terminal program.
But what if you don’t have the proper keyboard support, because the necessary layout isn’t there or you don’t want to install it? Or you want to investigate something that even a specialised keyboard layout cannot handle?
I’ll give an example, which is somewhat trivial because it was simplified from the real problem that made me look for a solution, which I will describe in the next chapter.
The letter á, an a with a acute accent, can be written directly as a self-contained Unicode character. It can be found in the code chart for the Latin-1 Supplement: the code, or Unicode scalar, is 00E1 hexadecimal. But an á can also be made as a composite character, consisting of a normal letter ‘a’ (hex 61), followed by hex 301 “Combining acute accent” from the Combining Diacritical Marks.
Unicode.org has rules and recommendations for what to use and how to handle variant forms, which are quite complicated: see Unicode Normalization Forms and Canonical Equivalence in Applications. I’m not sure I fully understand them, but I think the bottom line is that using á directly as a character is better than adding a separate diacritic.
So now suppose you want to make an HTML document or a whole site consistent, and see if the variant with the diacritic Unicode sign occurs anywhere? How to grep for it?
MSWord has a nice facility for entering any Unicode character, and
I find Libreoffice Writer now also has it (the version currently
installed in my Lubuntu 22.04 is 7.6.7.2):
You enter the hexadecimal digits of the Unicode scalar, followed
by Alt-x. So typing this: 61Alt-x301Alt-x
should produce a composite á. Except that it doesn’t, I get
this character: ꌁ. Is it a mouse? No, it’s a
YI SYLLABLE NZURX, whatever that is.
Trouble is Libreoffice Writer first renders the letter a from
61Alt-x
, and then combines it with the 301 typed
after it, to make the Unicode code point A301. Not my intention.
The better way to do it: first type 301Alt-x
, this produces
the combining acute accent with a circle to indicate where the
character to combine it with should appear. Move the cursor one
position to the left and type the letter a. Et violá:
á.
Now you can copy this, and paste it into the search command, to get this:
grep -r á .
In case that doesn’t work (and it may be hard with right-to-left scripts,
as in my real situation, as described below),
there is a another helpful trick, which I
found here. Someone there recommended
to type printf '\xE2\x98\xA0'
in the command line to get a skull and crossbones symbol:
☠.
You can also use that to compose a grep command containing special characters. But for that you need the actual bytes of the Unicode encoding, that is, the underlying UTF-8.
By the way, from my earliest encounters with Unix, starting in 1985,
I do remember printf
as part of the programming
language C, or more correctly of its stdio
library.
In terms of manual pages, that is printf(3)
, invoked as
man 3 printf
. I do not remember a man 1 printf
.
Perhaps it is a later extension, or a built-in of bash
.
Or maybe it was already there back then and I have forgotten
about it. Anyhow, it works in my bash 5.2.15
under
Lubuntu 23.10.
So when you know the Unicode scalar, how to find out what is UTF-8
for that? You may want to try my
utf8cntx or
utfcntxt for that.
In MSWord or Libreoffice Writer, save the text you obtained
previously as a text file, encoded as UTF-8 (that is usually
the default), and analyse that small text file with my tools.
From that you can learn that the Unicode scalars 61-301
are represented in UTF-8 as 61-cc-81
.
That makes the search command you need:
grep -r `printf '\x61\xcc\x81'` .
It works! Enclosing the printf command in backquotes (or backticks)
` and ` means: execute what’s in between first, and use the
result as a parameter for the other command. This is one of
the things that in 1985 made me fall in love with Unix: so
smart, so handy.
The actual situation that let me discover these tricks and made me write this article, was a little bit more complicated. So now I’ll tell the real story.
I have cron
periodically create a completely new
index for my search engine
siworin
,
which can optionally create a list of all the unique words
that are used in my website. I compare that list against the
previous run, which results in words that never occurred
before, but now do. See chapter 5, the siworin
algorithms,
item 8.
This is a great additional way to detect spelling errors:
when I find an error, I correct it, so nearly all the words
in the list are correct. That means that when I make a fresh
spelling mistake, it is likely to appear among the newly
used words that I have cron
e-mail to me.
Not all new words are errors, but some or many are.
On 30 June, while working on an article, two seemingly
identical Yiddish words (of Polish origin) appeared in
the e-mail I automatically received from my website:
בלאנדזשען
בלאָנדזשען
In fact they are not identical, couldn’t be because
then the word would appear in the list only once. Here
in the web article, the difference is visible: an unmarked
alef vs. alef qamats, א vs. אָ. But strangely, in nano
versions 7.2 (Lubuntu 23.10) and 8.0 (Alpine Linux 3.20)
they looked the same: the qamats diacritic was invisible.
Using my utf8cntx or utfcntxt I found that the sign was encoded as a regular Hebrew script alef, followed by a qamats diacritic. It is also possible to encode this character (used in Yiddish to write the letter and sound o) as a single Unicode sign, from the range of presentation forms: code FB2F (hex), Hebrew letter alef with qamats, אָ, אָ.
What is better? Which way is more usual? Whatever I do, I want it to be consistent site-wide. So I wanted to find what I had used in earlier articles, without having to remember where those are, because that is something computers are better at than humans.
Performing the steps described in the
previous chapter, I found out what
are the Unicode scalars, their UTF-8 encodings, and the commands
to look up the encoded characters. In the process, I got this
output from utfcntxt
:
000000 0xd7-90-0a-d7 0x0005d0: ............. 000002 0x0a-d7-90-d6 0x00000a: ............ 000003 0xd7-90-d6-b8 0x0005d0: ........... 000005 0xd6-b8-0a-ef 0x0005b8: .......... 000007 0x0a-ef-ac-af 0x00000a: ......... 000008 0xef-ac-af-0a 0x00fb2f: ........ 000011 0x0a-0a-d7-90 0x00000a: ....... 000012 0x0a-d7-90-d6 0x00000a: ...... 000013 0xd7-90-d6-b7 0x0005d0: ..... 000015 0xd6-b7-0a-ef 0x0005b7: .... 000017 0x0a-ef-ac-ae 0x00000a: ... 000018 0xef-ac-ae-0a 0x00fb2e: .. 000021 0x0a-..-..-.. 0x00000a: .
What is it? | Scalars | UTF-8 | Command |
---|---|---|---|
Plain alef | 05d0 | d7-90 | grep -r `printf '\xd7\x90'` . |
Alef qamats with diacritic | 05d0-05b8 | d7-90 d6-b8 | grep -r `printf '\xd7\x90\xd6\xb8'` . |
Alef qamats as presentation form | fb2f | ef-ac-af | grep -r `printf '\xef\xac\xaf'` . |
Alef patah with diacritic | 05d0-05b7 | d7-90 d6-b7 | grep -r `printf '\xd7\x90\xd6\xb7'` . |
Alef patah as presentation form | fb2e | ef-ac-ae | grep -r `printf '\xef\xac\xae'` . |
Result of the searches: I have indeed been inconsistent, probably as a result of quoting, i.e. copying, Yiddish text from other websites. The method with the diacritic is most frequent, the presentation forms are rare. How does the New York Yiddish newspaper Forward (פֿאָרווערטס, Forverts) do it? I’d prefer to follow that, as they use strict YIVO spelling, AFAIK.
From a sample article, I find that they do not use presentation form, only diacritics. So I’ll correct the places on my site where I didn’t.
By the way, I also found that Forverts uses the ligatures 05F0, 05F1 and
05F2 inconsistently: they often are, but the combinations of the
separate letters that the ligatures connect also occur:
05F0 or 05FD-05DF
05F1 or 05FD-05D9
05F2 or 05D9-05D9
So be it.
There was a problem with grep’s coloring (American spelling; colouring if you prefer) of search results. I will describe that in a separate article.
(This chapter added 3 June 2024.)
Editing such sequences with a text editor, or even a word processor can
he hard. I renamed a *.htm
file to *.htm.txt
to make Libreoffice treat it as a text file, without interpreting and
ruining the HTML. Then I could enter exact Unicode scalars using the
trick with Alt-x. However grep
still
found the presentation forms after that, not the version with diacritics
that I had entered. Would Libreoffice automatically change that?? That
would be bad. However I’m not entirely sure what happened.
Then I had the wild idea of combining sed
with printf
,
like this:
sed s/`printf '\xef\xac\xaf'`/`printf '\xd7\x90\xd6\xb8'`/g pres.htm > diac.htm
It looks weirdly complicated. But in fact it is just the basic and well-known
global sed substitution,
sed s///g
, with two printf
commands inserted between the slashes, each between backticks
` and `.
And it works! At least it does in my situation. Lubuntu 23.10,
GNU bash
version 5.2.15(1), GNU sed
version 4.9, printf
does not seem to have a
version number of its own.
Copyright © 2024 by R. Harmsen, all rights reserved.