Pattern searching

7–17 March 2020, translated from Interlingua by the author.

Table of contents

What is this?

This is a user manual for a digital search facility for various dictionaries of Interlingua. There are explicatory notes about possible ways to use the facility.

Originally, included dictionaries were Dutch-Interlingua (nl>ia) by Piet Cleij, Interlingua-Dutch (ia>nl) by the same author, and the Interlingua-English Dictionary (IED) by Alexander Gode (ia>en). Meanwhile, several others have been added.

Improvement 27 February 2020: The dictionary that a result came from is indicated at the start of the result line, unless you disable that by ticking the checkbox ‘Cela indication del dictionario’ (Hide indication of dictionary).

What can you do with it?

You can find words that are translated into the other language(s). Those are the dictionary lemmas. Because the dictionary materials are organised as simple text lines, you can also find words that occur in the translations, in examples and expressions, and in grammatical categories, special pronunciation indications, etc.

You can search on simple words or parts of words, but also on patterns, the so-called ‘regular expressions’. Those search patterns can vary from rather simple to extremely intricate. This will be clarified below, by many examples and not much text.

Where does the material come from?

The idea for the facilitaty did not come from me, Ruud Harmsen, but from Paul Denisowski, who offers an interface similar to mine, for many languages including Interlingua, in combination with English.

The material of the dictionaries by Piet Cleij that I use, I derived from these Wikias, which contain the material that is also in the printed books.

More about the methods of extraction and derivation, and about copyrights is here.

For the IED I use a copy of the text files that is in various places on the internet.

The user can tick checkboxes to include one or more dictionaries in the search, in any desired combination. The order of the dictionaries is however fixed, and corresponds to that of the checkboxes on the screen. If nothing is selected, the IED is selected by default.

Clarification of available dictionaries, added 22 May 2019:

codesource languagetarget languageauthor, source
ia-en (IED) Interlingua English Interlingua-English Dictionary, Alexander Gode
en-ia (G&B) English Interlingua Appendix II of the Interlingua Grammar, Alexander Gode & Hugh Blair
en-ia (S&G) English Interlingua Brian Sexton & Frank Peter Gopsill
ia-nl Interlingua Dutch Piet Cleij
nl-ia Dutch Interlingua Piet Cleij
fr-ia French Interlingua Piet Cleij. Based on material from his legacy. See this explication (in Interlingua only).
de-ia German Interlingua Fandom / Wikia, André Schild & Helmut. E. Ruhrig
es-ia Spanish Interlingua Fandom / Wikia
eo-en Esperanto English Paul Denisowski’s Esperanto page

What are patterns?

Search patterns, or ‘regular expressions’, are a very powerful method – that can be at times complicated and difficult to understand – for finding various texts without specifying a separate search argument for each variant you want to find.

Wikipedia has explicationes of all the details of regular expressions, in all the source languages of Interlingua (it, es, pt, en, fr), but not in Interlingua itself. Who will volunteer to combine this large amount of existing material, and write a good article for Wikipedia in Interlingua?

See also regular-expressions.info.

Here I will not give theoretical explicationes of regular expressions, but instead a lot of examples.

Examples

A simple word

With this search in Dutch, and this one in English, I found the synonyms (or near synonyms?) ‘cercar’, ‘recercar’ and ‘querer’, the last one of which I promptly used as the title of the Interlingua original of this manual, “Querer con patronos”, ‘To search with patterns’, or ‘Pattern searching’.

Lemmas only (Solo entratas)

Searching like this often isn’t ideal, because it finds too much! For example, the mentioned search with zoeken also finds ‘bezoeken’ (to visit) and ‘onderzoeken’ (to examine, to investigate), words that perhaps have a certain semantic and etymologic relationship with ‘zoeken’, but nevertheless are very different.

So a method is needed to limit the search to only one word, like the Dutch word ‘zoeken’. We can achieve this by preceding the search argument with the symbol ‘^’, like this.

The ^ symbolises the start of the line (in the same way as how $ indicates the end of it), and because the lines of the dictionaries normally start with the lemma, i.e. the word that the dictionary translates or explains, by searching in this manner, we find only those.

Option ‘Lemmas only’ (solo entratas)

It isn’t comfortable if you have to type that circumflex accent (^) at the start of the search text by hand. First, because you’d have to know where the symbol is located on your keyboard – that can vary a lot with different lay-outs. (The ^ is above the 6 on mine, so I press the shift key (⇧) together with the digit 6.

On some keyboard lay-outs, e.g. the ‘US International’ that is used a lot in the Netherlands, ^ is a dead key, used to create circumflexed letters like â, ê, î, ô and û. Therefore, to get a ^ by itself, you have to type ^ followed by a space.

To remedy this inconvenience the interface screen has a checkbox “Solo entratas” (Lemmas only). If this checkbox is checked, the circumflex accent is automatically put before the search text, in case it wasn’t already there. Click here for an example. Note that the ^ is added (if necessary) only after the button Cerca (Search) is pressed.

Parametrised URI

As you probably have noticed by now, the clickable examples in this manual open an extra tab, i.e. an extra page in the same instance of the web browser. The idea is that in one tab you read this manual, while in the other you can test the examples.

Using parameters in the URI (Uniform Resource Identifier), the search text and the checkboxes will be filled in advance. I won’t explain the parameters in full detail, because those who might want to know about them, will have to be people who are interested and skilled in information technology, so they can easily derive the specs from the URIs and HTML themselves.

22 April 2017: There is now also a link to call the parametrised URI that corresponds to the most recent search action.

The purest way to obtain that URI, including all the percent-encoding necessary for quoting it in fora etc., is by doing a ‘Copy Link Location’ in your browser. Then for example a square bracket [ in the search pattern will appear as %5B, a parenthesis ( as %28, etc. If however you click the link, the browser might change some details in its display of the URI.

So many words

An option similar to ‘Lemmas only’ (Solo entratas), but different, is “Parolas integre” (Whole words only). With that option ticked, you search not only from the beginning of the line (where the lemmas are), but in all of the line, including in translations and examples.

What is special about this option, is that only complete words are found, not longer words that contain the search word.

Examples:

Uppercase letters

By default the program ‘egrep’, which behind the scenes on the web server does all the hard searching work, distinguishes lowercase and uppercase. So searching ‘spanje’ does not produce ‘Spanje’, the Dutch name of the country Spain. However you can change this behaviour by ticking the checkbox “Alsi majusculas“ (Also uppercase).

There is also another way to achieve this: using a bracket expression, like this. [Ss] means that any of these two characters may appear in this position, in order to select the line with this search argument.

A similar example: if you know that the names of the people, language and country in the centre of Europe is something containing ‘german’, but not how to write them correctly concerning upper and lowercase, you can use these queries: german (with “Alsi majusculas“ checked) or [gG]erman (without that option). However, the two queries are not identical: the first would also find GERMANO, GeRmAn, germaN, etc., if these were in the dictionary.

Characters between [ ]

Indicating an acceptable choice of letters inclosed in square brackets ([ ]) isn’t limited to two letters as in the previous examples. You can put any sequence of characters there, possibly including ranges. For example, [aeiouáàéëêïy] represents all the vowels, some accented, [a-z] is all lowercase, [a-z][A-Z] all lowercase and uppercase (alternative: [[:alpha:]]), and [a-zA-Z0-9] all the alphabetic letters and digits (alternative: [[:alnum:]]). More possibilities are in Wikipedia.

Finding variants

This facility is useful for finding words without already knowing how exactly they are written in the dictionary.

What was the name of the symbol ‘^’? Accento circonflex, circumflexe, circomflexa? In what languages? I always forget. Let’s ask the dictionaries: circ[ou][nm]flex[eao].

Is what is found in modern chips, in Dutch written with a ‘c’ or a ‘k’? Answer: both.

Does Interlingua have words like the italiano cui, qui, que, and if yes, what is their meaning? Ask the dictionaries.

Alternatives with |

The symbol ‘|’ (in the upper right of my keyboard, over the backslash ‘\’) indicates a choice, a logical ‘or’ condition. With this we can, as an example, extend the previous example with the Italian word ‘che’ (which doesn’t exist in interlingua, but let’s assume we didn’t know that in advance):
(che|[cq]u[ei]).

Yet another complicated example: find all occurrences of the Dutch verb ‘stappen’, and of the noun ‘stap’ with the diminutive suffix‘-je’, preceded by one of the prefixes in, op, uit, over, af and ver: (in|op|uit|over|af|ver)stap(pen|je).

Repetition operators

Regular expressions have operators to indicate how many times a character may be repeated:

The repetition specification can help us find the correct orthography where doubled letters are concerned: app?el{1,2}ar finds ‘appellar’ but would also find ‘appelar’, ‘apellar’ and ‘apelar’ if they were present.

An alternative search would be ap+el+ar, which however is not identical, because that finds the word ‘appellar’ written with one, but also three, four etc. letters ‘p’ or ‘l’.

The repetition not only refers to characters, but also to classes ([[::]]) and groups of characters ([]), and to sequences between (). Example: (an){2} finds where ‘an’ is followed by another ‘an’: in the Dutch and Interlingua word ‘ananas’, and in the English, Dutch and Interlingua word ‘banana’. And in ‘lontanantia’. I somehow like that word.

Phrases, expressions and collocations in the IED

In the IED, example phrases, expressions and collocations (words that typically occur together) are given in separate lines, between the symbols ` (inverted single quote) and ' (single quote). The search instruction is: ^`.+. In this, the dot represents the notion ‘any character’.

With expressiones in the IED, the interface option “Alsi linea previe” (Also previous line) serves to show more context, namely the previous line, which normally contains the lemma that belongs to the expression: like this. There is also the option “Alsi linea proxime”, which obviously displays the next line, the line after the search result.

The variant without ^ finds references, often in connection with the so-called double-stem verbs. Alternative: searching {see}.

Phrases, expressions and collocations in nl>ia>nl

In expressions, the indication ‘~’ often takes the place of the lemma, which itself is at the start of the line. Therefore if you search for an expression with two words that occur in it, the order in the line may differ from the order in real sentences. Then the best way to find the expression is to try both orders of the words.

An example: searching aan.+niets and niets.+aan gives the best chances of finding Dutch expressions like ‘daar is niets aan te doen’ (nothing can be done about it, it is inevitable) and ‘daar is niets aan’ (that isn’t hard, or: that isn’t interesting).

Here the dot ‘.’ symbolises any character, and the plus sign ‘+’ indicates the repetition: one or many times.

Of course the two orders could also be combined in a single search instruction.

Addition 15 June 2016:
It is now no longer necessary to do this by hand, because now there is a proximity search operator. You can use APUD or NEAR (they have to be uppercase). Thus the search aan APUD niets is equivalente to, and is internally executed as aan.+niets|niets.+aan. Much easier and more comfortable.

Option Sin exemplos (without examples)

Addition 13 December 2015:
It is now also possible to suppress the usage examples of words. Of course the examples are clarifying, but their abundance can sometimes be confusing. Therefore checking the option “Sin exemplos” (Without examples) means they are no longer shown; an extra filter step removes all the lines that contain a ‘~’.

Only ASCII, and what else?

One of Interlingua’s fortes, especially in comparison with Esperanto, in my opinion is that Interlingua requires only the twenty-six letters of the Latin alphabet. In other words, ASCII is enough to write it, no ISO-8859-n or Unicode is necessary. Simple, clear, effective and elegant.

However there is a small number of words that are more correctly written with an accent. It is not strictly required, but usual. The words concerned are not very frequent.

Here I demonstrate how to find such words. There are 89. In the IED there are 46.

Deviant stress

IED

In the IED the acute accent is used to indicate whether a word is stressed is a way different than usual.

nl>ia>nl

In the dictionaries by Piet Cleij, unusual stress is marked by underlining. In the interface we can find those via the underlying HTML codes <u> e </u> (where u means underline). Once you know that, making a search command isn’t hard: <u>..?</u>.

My extraction programma, which creates the electronic material for the search, puts the words with underlined vowel between parentheses, preceded by the word without the underlining. So ‘capite’ becomes ‘capite (capite)’. The advantage is that all words can be found, even when the stress is previously unknown, without having to enter invisible codes <u> and </u>.

Deviant pronunciation in nl>ia>nl

In the dictionaries by Piet Cleij, pronunciations that deviate from the normal rules are indicated between { }. We can find them in this way. Because the curly brackets { and } have a special meaning, to find the characters themselves you must precede them by a backslash: \{ and \}.

In many cases the pronunciation indicated is that of the French digram ch, imitated in Dutch by sj.

Square brackets in the IED

Some lemmas in the IED are in [ ]. They are findable in the electronic IED with this search command: \[.+\]. Because the square brackets ‘[’ and ‘]’ have a special meaning in regular expressions, to find the characters themselves you have to put a backslash before them: \[ and \].

A more complicated search (which clearly shows the regular expressions aren’t always easy to understand at first sight), which however is more complete and reliable: (^\[[a-z].+\])|(\[[^ -\.]+\]). There are 236 resultats.

In June 2013 Stanley A. Mulaik told us that:
They were included in the last minute, with the help of Blair (according to a letter Dr. Gode sent to me). They were taken from other constructed languages on the condition that 'they do not seem too strange in the context of the rest of the vocabulary'.

And at the end of the Explanatory Notes in the final section of the Introduction to the IED (a part that is strangely missing in the translation into Interlingua!) there is this remark:
Bracketed Entries. – Bracketed entries are words used in one of the major traditional auxiliary languages. They are included in this Dictionary as being neither incompatible with its principles nor a necessary product of them.

Stan Mulaik also wrote:
Many Interlinguists use some of those, and some of the Latin particles. The choice is not uniform. This is subjective, because there aren’t any guidelines to common forms.

That is true: many of the IED words in [ ] today are in commun use: an, ancora, anque, ci, desde, esque, ja, ma, nec, on, poc, poco, quam, ser, sera, serea, sia, sovente, tro, troppo, ulle, vamos. But not: atque, aut, donec, el, ella, esso, este, haver, homo, isse, isso, jo, magis, mi, trop, voi.

That the selection is subjective and unguided, I personally do not feel as a problem, but rather as a strong point of interlingua: it makes it richer, more flexible and varied, and thereby suitable for my purposes. And perhaps for those of others.


Only the basic vocabulary

12 December 2015: a new function: search only the 2500 words of the basic dictionario, selected from Piet Cleij’s Interlingua-Dutch dictionary.

Here is more information (in Interlingua and Dutch).


Postfilter

21 July 2019

Situation and reason for adding this: while translating a little article of mine, the question was: “how to translate to Interlingua the Dutch phrasal verb ‘eruithalen’, in the special sense of …”, and that I didn’t know how to say in Interlingua (nor easily in English).

How is that officially written in Dutch, er uithalen, eruit halen, eruithalen? And more importantly, how did the compiler of the dictionary think it should be written? Where do I look, under halen, uithalen, eruithalen, halen APUD uit?

The simplest way is to search for ‘halen’. But the dictionary of Piet Cleij is so extensive, that this produces a list of 150 lines! Is it really necessary to wade through all those results to see if the special sense has been treated?

This made me think: “if from this list I could only see the lines that also contain the character sequence ‘uit’, life would be so much easier!” And this I have done: let the computer work for me, so that after the first egrep to extract results from the dictionary file, it does a second egrep to show only the lines that contain ‘uit’! The reduction is from 510 to 78 lines!

Something with ‘tirar’ or ‘(ex)traher’ will be the translation sought.

Later addition: (19 January 2020): instead of selecting lines from the results, it is now also possibile to exclude them, by ticking the checkbox “exclude”.

Colours: Boring Crazy Do as you like Show with new setting