Deze uitleg in het Nederlands

Unicode search in Google

13 October 2001

I found that in Google it is possible to use search words coded in Unicode, in non-Latin alphabets. With other search engines, such as AltaVista or Northern Light this didn't work.

The simplest way of course would be to just type such a search word into Google's search word field, and then start the search. That probably does work, but I didn't test it, for lack of keyboard definitions that produce such texts in different alphabets.

What did succeed was to copy text from a browser or text processor, and then paste that into Google. Google actually finds texts in the language in question, that indeed contain the sought word!
Whether this also works in Russian, Ukranian etc., with sites encoded in the more usual KOI8 I don't know. Somebody else does? .

Below are some examples to try it out with. Mark and copy a name from the left column, surf to Google, paste the name into the search word field, and start the search.

בראשית ברא אלהים את השמים ואת הארץ. The first verse of the bible, in Hebrew: Bereshit bara elohim et hashamayim wet haretz -
In the beginning God created heaven and earth.
فصيح  An Arabic word, faṣīḥ in transcription, which means "in pure, good Arabic".
بِسـْمِ اٱلّٰهِ bismillahi - In the name of God
٠١٢٣٤٥٦٧٨٩
۰۱۲۳۴۵۶۷۸۹
٢٠٠٦
۲۰۰۶
Arabic Arabic digits, as opposed to Latin Arabic digits. Next: Eastern Indic-Arabic digits.
Хрущёв Name of a former Russian president: Chrushchov.
Milošević Name of a former Yugoslav and Serbian president: Milosevic, written in the Latin alphabet with diacritics, as used in Croatia, Bosnia and Serbia.
Милошевић The same name in Serbian Cyrillic spelling.
Note that this Cyrillic alphabet is very similar to the one used for Russian, but not equal in all details. For example the ћ is not used in Russian, but would be written ть.
Krstić, Крстић Another name from that part of the world: Krstic, again in both alphabets that are used in ex-Yugoslavia.
Łdź The Polish city Lodz (sounds something like Wootch), as the Polish themselves write it.
Kettő, KETTŐ The Hungarian word for "two", and again in all capitals.
Bucureşti, Chişinău, Constanţa, Timişoara, Buzău, Nicolae Ceauşescu or better, with comma below instead of cedilla: București, Chișinău, Constanța, Timișoara, Nicolae Ceaușescu. Some Romanian (Romnă) and Moldavian (city and other) names.
Ĉ Ĝ Ŝ Ĵ Ĥ Ŭ, ĉ ĝ ŝ ĵ ĥ ŭ. Special characters for Esperanto.
Antonn Dvořk The famous Czech composer
Ů, ů Special characters for Czech: u with ring.
Inteŀligent, INTEĿLIGENT
Intel·ligent, INTEL·LIGENT
This Catalan word (meaning, you guessed it, "intelligent") is written with a dot between the l's, to indicate that it has a long normal l sound, not a single palatal l, which (as in Castilian Spanish) is writen ll. There are two ways to write this: with a special character "l plus dot", or with normal l's and a special dot in between.
IJsland in de ijzertijd
IJsland in de ijzertijd
"IJsland in de ijzertijd". This is Dutch for "Iceland in the iron age". What's special about it is that the I and J are capitalised together. Some say it is a single letter, some say they're two. Unicode has special characters for them, hex 0132 IJ and 0133 ij. But they are hardly ever used, because as you see, the combination of i and j looks the same in most fonts, except that the kerning is different. Question (don't look at the html for the moment): where are the special characters, above or below?
Μίκης Θεοδωράκης The famous Greek composer Mikis Theodorakis

If in the left column you don't see the correct characters, but rather some squares or something, then your browser isn't suitable for this. Microsoft built Unicode support into MS Explorer 5.0 and 5.5 (and also Word 97) under Win98, even though Windows 98 itself lacks practically every support for it.
The Windows API, which is usable for all 32 bits Windows varities, offers a simple way to have "wide character" versions of functions like SetDlgItemText used everywhere. But even then it still doesn't work under Windows 95 and 98. Whether this is technical inabililty, or yet another attempt to make life harder on competing browser manufacturers - Opera 5 and Netscape 3 don't support Unicode - I don't know.
It is fine that Word 97 and Explorer do support Unicode, but technically speaking it is a bad solution, because this is typically something that belongs in an operating system, not in every single application program.

For lack of applications and keyboard definitions for directly typing text in non-Latin alphabets - they do exist, but I don't have them, and I normally don't need them - I created the search words in the table above as follows:

A code such as ć resembles things (the so-called entities like ü for an (u with umlaut or diaeresis), and ű, in which 369 is the decimal code for the Hungarian u (ű) with a "long" umlaut (double acute accent), in character set Windows 1252.
The characters & and ; enclose the encoding. If there is a number sign # it means that a number will follow, not a symbolic code like amp for ampersand or gt for "greater than". Such a code is decimal by default, but by putting an x before it you can specify that the number should be interpreted as hexadecimal.

It is also possible to write HTML pages in UTF-8 (with a proper header, of course), which eliminates the need for entities.


See also A Unicode test page


Deze uitleg in het Nederlands

Colours: Neutral Weird No preference Reload