ast weekend I spent an hour looking through books in our
home library that I had not pulled out for a while. One of them, a work
on linguistics and Bible translation,1 started with these two paragraphs:
Few of us even think of the extraordinary complexity of
the speech process. Most of us simply talk. True enough we talk with
varying degrees of fluency, but every normal child learns to talk, and
this is quite a remarkable achievement.
Consider the simple fact that there is an infinite
number of sentences potentially available in each language known to us.
The sentence which stands at the beginning of this paragraph has
probably never before been written down or spoken or even thought. Very
similar sentences have been produced before, and other similar
sentences will be produced again, but it is probable that until this
chapter was written no one had ever before written or said: "Consider
the simple fact that there is an infinite number of sentences
potentially available in each language known to us."
When I first read this I found myself marveling at the
incredibly variety of human expression through language. But that
amazement came to a screeching halt when I realized what this statement
means for the translation technology that I've been talking about for
years and that many of us are using—translation memory.
To test the statement's validity, I rushed to my computer to perform a
Google search on the sentence in question. Not a single hit! Indeed,
chances are that the two occurrences of the sentence above were truly
the only times this particular sentence was ever written. (Of course,
now it has been mentioned in this article as well, so it will show up
in Google from now on.)
Our technology has just given us
access to the smaller building blocks of language, and it would be a
shame not to use those.
What does this mean for us? A translation memory in its
simplest form, of course, is nothing more than a collection of
translated segments that occur most typically in sentence form. We all
know that some kinds of texts have a fairly high degree of repetition,
including instructional text, legal boiler plates, and certain medical
phrasing, but the repetition in the majority of texts does not happen
on the sentence level.
Until very recently, the translation memory component in
most translation environments tools was stuck at almost exactly the
same place it had occupied in the early 1990s. Perfect and fuzzy
matching on the segment level and manual concordance searches through
the translation memory were essentially the only ways to get to the
data, meaning that the largest amount of data in our translation
memories was doomed to the life of Sleeping Beauties—lots of
data, all slumbering, most of it beautiful.
Only in the last year or two have things started to
change. Partly due to increased competition and partly due to
frustration on the side of us consumers about the restricted usefulness
of the old paradigms, tool developers have been forced to look at their
existing technology to try to find ways to extend its use.
Hold on to that thought while we investigate one area in
which the use of translation memories has traditionally not been fully
exploited. Let's look at our example sentence again:
- "Consider the simple fact that there is an infinite
number of sentences potentially available in each language known to us"
has 0 Google hits2
- "Consider the simple fact" has about 108,000 Google
- "an infinite number of sentences" has 64,800 Google
- "potentially available in each" has 41,300 Google hits
- "language known to us" has 35,600 Google hits
I know that we're typically not worried about what is or
is not available through search engines. But as translators we should
be concerned about how we can reuse data that has already been
translated, either by us or by somebody else. And the numbers above
show us that there is an infinitely greater likelihood of finding
matches for segments within a sentence than for the complete sentence.
As it so happens, this is exactly the one area that
almost all tools developers have recently focused on: the extended and
(semi-) automated use of so-called subsegments, or segments within
whole sentences. Interestingly, though, the developers have approached
this goal from very different angles. Here are some examples:
- Trados Studio
uses the so-called AutoComplete dictionaries to
distill data from translation memories and then offers suggestions
based on the source segment and the first few keystrokes.
performs subsegment searches based on its Longest
Substring Concordance (LSC) technology to suggest subsegments in its
regular search pane.
- Déjà Vu X2
uses its DeepMiner technology to analyze matches of
subsegments between source and target based on number of occurrences
and uses those for a variety of things, including correction of fuzzy
- Star Transit
looks into the target part of its reference
materials (Transit's equivalence of a translation memory) if no
matches in the source are found and makes suggestions based on your
first few keystrokes.
There are two other tools—Lingotek and Multitrans—that
have used subsegment searches for a long time, but partly due to their
lack of market penetration they have not been able to make the splash
that the more well-known tools are now creating.
What does this mean for our practical work?
The first thing that comes to mind is that the quality
of the materials within a translation memory is more important than
ever. Whereas before we might not have had to worry about the existence
of poorly translated segments within our translation
memory—chances were that the prince would never show up to wake
those Sleeping Beauties anyway—now every translation unit or
segment pair within a translation memory is actually being used by the
subsegmenting abilities of our tools. If you used to roll your eyes
about "translation memory maintenance" as "one of those things that one
should do" but you never did anything about it, you now might actually
have to start considering taking action. A good starting place might be
to look at Olifant,
the most powerful (and free) translation memory maintenance tool that
is currently available.
The second arena where the new subsegmenting feature
comes into play is in gaining a renewed understanding of the usefulness
of accessing data from external sources. While aligning previously
translated materials from the same client might not have always been a
cost-effective process with the old translation memory paradigm of
perfect and fuzzy matches, subsegmenting gives a much higher leverage
and dramatically increases the usefulness of a variety of data sources.
Aside from aligned data and well-maintained legacy translation
memories, data resources such as the ones offered by TDA or
also prove to be more useful and productive.
So, yes, the authors of that long-forgotten book in my
library might have been right in regard to many complete sentences. But
our technology has just given us access to the smaller building blocks
of language, and it would be a shame not to use those.
Cotterell & Max Turner: Linguistics & Biblical Interpretation.
InterVarsity, 1989. 2
These and all other Google statistics were generated on September 13,