Unicode search in Google

13 October 2001

I found that in Google it is possible to use search words coded in Unicode, in non-Latin alphabets. With other search engines, such as AltaVista or Northern Light this didn't work.

The simplest way of course would be to just type such a search word into Google's search word field, and then start the search. That probably does work, but I didn't test it, for lack of keyboard definitions that produce such texts in different alphabets.

What did succeed was to copy text from a browser or text processor, and then paste that into Google. Google actually finds texts in the language in question, that indeed contain the sought word!
Whether this also works in Russian, Ukranian etc., with sites encoded in the more usual KOI8 I don't know. Somebody else does? .

Below are some examples to try it out with. Mark and copy a name from the left column, surf to Google, paste the name into the search word field, and start the search.

בראשית ברא אלהים את השמים ואת הארץ.	The first verse of the bible, in Hebrew: Bereshit bara elohim et hashamayim weët haäretz - In the beginning God created heaven and earth.
فصيح	An Arabic word, faṣīḥ in transcription, which means "in pure, good Arabic".
بِسـْمِ اٱلّٰهِ	bismillahi - In the name of God
٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ٢٠٠٦ ۲۰۰۶	Arabic Arabic digits, as opposed to Latin Arabic digits. Next: Eastern Indic-Arabic digits.
Хрущёв	Name of a former Russian president: Chrushchov.
Milošević	Name of a former Yugoslav and Serbian president: Milosevic, written in the Latin alphabet with diacritics, as used in Croatia, Bosnia and Serbia.
Милошевић	The same name in Serbian Cyrillic spelling. Note that this Cyrillic alphabet is very similar to the one used for Russian, but not equal in all details. For example the ћ is not used in Russian, but would be written ть.
Krstić, Крстић	Another name from that part of the world: Krstic, again in both alphabets that are used in ex-Yugoslavia.
Łódź	The Polish city Lodz (sounds something like Wootch), as the Polish themselves write it.
Kettő, KETTŐ	The Hungarian word for "two", and again in all capitals.
Bucureşti, Chişinău, Constanţa, Timişoara, Buzău, Nicolae Ceauşescu or better, with comma below instead of cedilla: București, Chișinău, Constanța, Timișoara, Nicolae Ceaușescu.	Some Romanian (Română) and Moldavian (city and other) names.
Ĉ Ĝ Ŝ Ĵ Ĥ Ŭ, ĉ ĝ ŝ ĵ ĥ ŭ.	Special characters for Esperanto.
Antonín Dvořák	The famous Czech composer
Ů, ů	Special characters for Czech: u with ring.
Inteŀligent, INTEĿLIGENT Intel·ligent, INTEL·LIGENT	This Catalan word (meaning, you guessed it, "intelligent") is written with a dot between the l's, to indicate that it has a long normal l sound, not a single palatal l, which (as in Castilian Spanish) is writen ll. There are two ways to write this: with a special character "l plus dot", or with normal l's and a special dot in between.
Ĳsland in de ĳzertĳd IJsland in de ijzertijd	"IJsland in de ijzertijd". This is Dutch for "Iceland in the iron age". What's special about it is that the I and J are capitalised together. Some say it is a single letter, some say they're two. Unicode has special characters for them, hex 0132 Ĳ and 0133 ĳ. But they are hardly ever used, because as you see, the combination of i and j looks the same in most fonts, except that the kerning is different. Question (don't look at the html for the moment): where are the special characters, above or below?
Μίκης Θεοδωράκης	The famous Greek composer Mikis Theodorakis

If in the left column you don't see the correct characters, but rather some squares or something, then your browser isn't suitable for this. Microsoft built Unicode support into MS Explorer 5.0 and 5.5 (and also Word 97) under Win98, even though Windows 98 itself lacks practically every support for it.
The Windows API, which is usable for all 32 bits Windows varities, offers a simple way to have "wide character" versions of functions like SetDlgItemText used everywhere. But even then it still doesn't work under Windows 95 and 98. Whether this is technical inabililty, or yet another attempt to make life harder on competing browser manufacturers - Opera 5 and Netscape 3 don't support Unicode - I don't know.
It is fine that Word 97 and Explorer do support Unicode, but technically speaking it is a bad solution, because this is typically something that belongs in an operating system, not in every single application program.

For lack of applications and keyboard definitions for directly typing text in non-Latin alphabets - they do exist, but I don't have them, and I normally don't need them - I created the search words in the table above as follows:

Find the codes for the needed letters in the Unicode code charts. For instance, the c with acute accent at the end of many Yugoslav names has code 0107.
Using a text editor, put this code in an html file as follows: ć. The name Krstić then becomes Krstić.
See also ways to encode Unicode.
Open this file (directly from the local hard disk, or after transferring it to the internet) in Explorer 5. The text now appears with the correct characters.

A code such as ć resembles things (the so-called entities like ü for an ü (u with umlaut or diaeresis), and ű, in which 369 is the decimal code for the Hungarian u (ű) with a "long" umlaut (double acute accent), in character set Windows 1252.
The characters & and ; enclose the encoding. If there is a number sign # it means that a number will follow, not a symbolic code like amp for ampersand or gt for "greater than". Such a code is decimal by default, but by putting an x before it you can specify that the number should be interpreted as hexadecimal.

It is also possible to write HTML pages in UTF-8 (with a proper header, of course), which eliminates the need for entities.