Evaluation of DVX – Proposal: AutoJoin

25 December 2008

Summary

There are situations in which it is useful to reimport a file into Deja Vu X (DVX), although part of the translation has already been done. From the information in the MDB (Memory DataBase), it should be possible most or all of the earlier translations as exact matches.

To get this closer to "all" than to "most", I propose the algorithm AutoJoin, for inclusion in DVX as an option.

This is an early draft only, that needs reworking

Note: Auto Join could also be useful in combination with AutoPropagate!

Situations

Situations in which a re-import may be useful include the following:

  1. Word, client suddenly remembers they want Trados uncleaned. (Bad example, because joining is not possible in Trados Workbench files imported into DVX. Should/could it be made possible in this case?)

  2. Large number of typos and spelling errors. Better to correct these in the source file first, to avoid MDB fouling.

  3. Hard returns due to an incompetent writer, or conversion from PDF, directly or via OCR. Hard returns removed before import, but in an annoyingly large number of cases, proper removal was forgotten (oversight).

  4. RTF/Doc issue, see 2081221a (hyperlink)

Reasons for non-perfect recovery of earlier translations

This is an early draft only, that needs reworking

  1. Autosend to MDB not switched on.

  2. Minumum length of segments to be saved in the MDB. (Does that setting still exist in DVX, or DV3 only?)

  3. Numbers in segment: not counted for minimum length? (DV3, retest for DVX).

  4. Manual joins during the translation process.

    • nr., nrs., no., nos.
    • bzw.
    • hard returns (PDF, OCR) @? already mentioned under situations
    • colons
    • semicolons

This is an early draft only, that needs reworking

AutoJoin Algorithm

   if ((source segment gives fuzzy match from MDB)
         and 
       (source segment, disregarding codes, exactly matches 
         the first part of found MDB match, 
         i.e the MDB match is longer but otherwise the same)
      )
   then
   {
      see if joining the next segment in the project
         to the current segment creates:
      either an exact match (disregarding codes)
         if so, make that join permanent in the project file
      or a partial match of the same type as before,
         but the initial part that exactly matches
         is longer than before
         if so: continue joining more following segments 
         from the MDB, and try again.
    }

Note: the above pseudocode isn't well structured and contains an implicit "go to", so it should be re-structured to go in the direction of a proper real implementation. But hopefully it makes the idea clear.

The fuzzy match on the longer MDB segment should take precedence over a possible shorter exact match, which may be present due to an earlier "send to MDB" before the manual join was done.

Real life example:

Well, real life, I made it up. That's why it sounds stupid and looks ugly. But it may still make sense. I hope.

Original source sentence:

What you could do is the following: e.g. you could continue and try again.

Segmented as:

What you could do is the following:
e.g.
you could continue and try again.

Partial translation to Dutch:

Then the translator decides to perform two joins (ctrl-J), and translation the result as a single segment:

The MDB (as a result of AutoSend) now contains the following pairs:

Now for some unrelated reason, the translator reimports the source document, and pretranslates hoping to recover the translations already done. Pretranslate find the partial source segment: Reverse case: Twee zinnen achter elkaar (proj ATS 20090121) (waarom zo gesegmenteerd?) die allebei exact in het MDB voorkomen, maar hij vindt zelfs geen fuzzy match, maar stelt samen uit allerlei losse stukken.

This is an early draft only, that needs reworking

Severity

Workarounds are available, so the problem is not severe.

History

The issue applies to 7.5.303, and also applied to DV3 builds such as 3.0.38.