Building Blocks

by Jost Zetzsche, Ph.D.

ast weekend I spent an hour looking through books in our home library that I had not pulled out for a while. One of them, a work on linguistics and Bible translation,¹ started with these two paragraphs:

Few of us even think of the extraordinary complexity of the speech process. Most of us simply talk. True enough we talk with varying degrees of fluency, but every normal child learns to talk, and this is quite a remarkable achievement.

Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us. The sentence which stands at the beginning of this paragraph has probably never before been written down or spoken or even thought. Very similar sentences have been produced before, and other similar sentences will be produced again, but it is probable that until this chapter was written no one had ever before written or said: "Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us."

When I first read this I found myself marveling at the incredibly variety of human expression through language. But that amazement came to a screeching halt when I realized what this statement means for the translation technology that I've been talking about for years and that many of us are using—translation memory.

Our technology has just given us access to the smaller building blocks of language, and it would be a shame not to use those.

To test the statement's validity, I rushed to my computer to perform a Google search on the sentence in question. Not a single hit! Indeed, chances are that the two occurrences of the sentence above were truly the only times this particular sentence was ever written. (Of course, now it has been mentioned in this article as well, so it will show up in Google from now on.)

What does this mean for us? A translation memory in its simplest form, of course, is nothing more than a collection of translated segments that occur most typically in sentence form. We all know that some kinds of texts have a fairly high degree of repetition, including instructional text, legal boiler plates, and certain medical phrasing, but the repetition in the majority of texts does not happen on the sentence level.

Until very recently, the translation memory component in most translation environments tools was stuck at almost exactly the same place it had occupied in the early 1990s. Perfect and fuzzy matching on the segment level and manual concordance searches through the translation memory were essentially the only ways to get to the data, meaning that the largest amount of data in our translation memories was doomed to the life of Sleeping Beauties—lots of data, all slumbering, most of it beautiful.

Only in the last year or two have things started to change. Partly due to increased competition and partly due to frustration on the side of us consumers about the restricted usefulness of the old paradigms, tool developers have been forced to look at their existing technology to try to find ways to extend its use.

Hold on to that thought while we investigate one area in which the use of translation memories has traditionally not been fully exploited. Let's look at our example sentence again:

"Consider the simple fact that there is an infinite number of sentences potentially available in each language known to us" has 0 Google hits²
"Consider the simple fact" has about 108,000 Google hits
"an infinite number of sentences" has 64,800 Google hits
"potentially available in each" has 41,300 Google hits
"language known to us" has 35,600 Google hits

I know that we're typically not worried about what is or is not available through search engines. But as translators we should be concerned about how we can reuse data that has already been translated, either by us or by somebody else. And the numbers above show us that there is an infinitely greater likelihood of finding matches for segments within a sentence than for the complete sentence.

As it so happens, this is exactly the one area that almost all tools developers have recently focused on: the extended and (semi-) automated use of so-called subsegments, or segments within whole sentences. Interestingly, though, the developers have approached this goal from very different angles. Here are some examples:

Trados Studio

memoQ

Déjà Vu X2

Star Transit

Transit's

There are two other tools—Lingotek and Multitrans—that have used subsegment searches for a long time, but partly due to their lack of market penetration they have not been able to make the splash that the more well-known tools are now creating.

What does this mean for our practical work?

The first thing that comes to mind is that the quality of the materials within a translation memory is more important than ever. Whereas before we might not have had to worry about the existence of poorly translated segments within our translation memory—chances were that the prince would never show up to wake those Sleeping Beauties anyway—now every translation unit or segment pair within a translation memory is actually being used by the subsegmenting abilities of our tools. If you used to roll your eyes about "translation memory maintenance" as "one of those things that one should do" but you never did anything about it, you now might actually have to start considering taking action. A good starting place might be to look at Olifant, the most powerful (and free) translation memory maintenance tool that is currently available.

The second arena where the new subsegmenting feature comes into play is in gaining a renewed understanding of the usefulness of accessing data from external sources. While aligning previously translated materials from the same client might not have always been a cost-effective process with the old translation memory paradigm of perfect and fuzzy matches, subsegmenting gives a much higher leverage and dramatically increases the usefulness of a variety of data sources. Aside from aligned data and well-maintained legacy translation memories, data resources such as the ones offered by TDA or MyMemory might also prove to be more useful and productive.

So, yes, the authors of that long-forgotten book in my library might have been right in regard to many complete sentences. But our technology has just given us access to the smaller building blocks of language, and it would be a shame not to use those.

¹ Peter Cotterell & Max Turner: Linguistics & Biblical Interpretation. InterVarsity, 1989. ² These and all other Google statistics were generated on September 13, 2011.