You can view earlier editions of the Tool Box Journal going all the way back the 2007
in the archives to which you have access if you support my work on the Journal.

Tool Box Logo

 A computer journal for translation professionals

Issue 19-2-297
(the two hundred ninety seventh edition)  


1. The End of Metadata

2. Humility is Good

3. MMT

The Last Word on the Tool Box

Mad Men

Reader Sarah Hudson sent me this excerpt last week from the TV show Mad Men. Many of you will remember this scene. Sarah says: "I was thinking about your newsletter the other day when I was watching the TV series Mad Men. There was a particular scene that really resonated and I think it sums up the way some translators feel about developments in technology. I thought you would appreciate it."

Here it is:

It is the 1960s and we are looking in on the offices of a smart advertising agency on Madison Avenue in New York. The main character of the series Don Draper is having a conversation with an engineer from IBM who is installing a revolutionary new machine called a computer which pretty much fills a whole office as it is the size of a small car. The conversation goes as follows:

IBM engineer: I go into businesses every day and it has been my experience that these machines can be a metaphor for whatever is on people's minds.

Draper: Because they are afraid of computers?

IBM engineer:  Yes, this machine is frightening to people, but it's made by people.

Draper:People aren't frightening!

IBM engineer: It's not that. It's more of a cosmic disturbance. This machine is intimidating because it contains infinite quantities of information and that is threatening because human existence is finite, but isn't it God-like that we have mastered the infinite. The IBM 360 can count more stars in a day than we can count in a lifetime.

Draper:But what man lay on his back counting stars and thought about a number. He probably thought about going to the moon.

Clever, huh?


One more thing before we launch into this (I think) interesting Tool Box Journal. I have mentioned the TranslationTalk Twitter account and it continues to be a source of great joy for me and the other 3000+ followers. Now there is also an archive where you can view past tweets as well as the curators for each. You can find it here.


Less than one week to register for Memsource's Bay Area Meetup!

Learn about leveraging innovative translation tech in global businesses and network with localization professionals in the area.  

Save your seat now


1. The End of Metadata 😢

TAUS, the Translation Automation User Society, just celebrated its 10th anniversary, and chances are you are aware of its existence. If not, the easiest way to describe it might be this: It's an association made up of very large translation buyers as well as some other organizations and companies that try to advance translation technology, in particular machine translation. It does that by educating its members and other stakeholders, generating networking possibilities, providing a framework to test (machine) translated data, and collecting and disseminating data among its members (their four "pillars" accordingly are Connect, Discover, Measure, and Improve).

TAUS's passionate director is Jaap van der Meer, who truly is an industry veteran (if ever there was one). It's probably fair to say that his vision drives TAUS, though I'm sure it's ably co-piloted by its 15 employees and its members. The fate of the visionary, however, is that some ideas work and others don't. And so it has been with TAUS. Some early ideas of data sharing, for instance, did not work out as they were first envisioned, but other concepts like the DQF, the Data Quality Framework, are widely used today, and the terminology search of the TAUS Data cloud has a loyal following among those who have discovered it (as you should if you haven't yet done so) and integrated it into their workflow (this, by the way, is a very generous donation to the translator community, if you ask me).

One of the original core missions of TAUS was data exchange between members or, added at a later point, the ability to purchase data. A taxonomy was created that allowed you to select the data you needed with the help of defining metadata, or data about the data. You could select from 17 different industries as well as data owners and content type (the same kind of fields that are also available in the terminology search tool mentioned above). This seemed to be a reasonable approach if your goal was to find data to feed and train very data-hungry statistical machine translation engines. Just recently, though, a paradigm change was introduced that suggests that filtering by metadata actually might be harmful rather than helpful in some cases.

On one level this is easy to understand. When I talked with Jaap, he mentioned the example of a company that produced software for financial transactions. Should the data they submit to the system be submitted as belonging to the software industry or financial industry? And vice versa, what should they look for if they look for data? Unless you have very sophisticated, multi-layered filter mechanisms, you are guaranteed to miss out on some data and receive some that is less than helpful.

So recently when one large member company, Oracle, was on the hunt for "colloquial data" -- non-specialized data for chat bots, etc. -- TAUS discovered three things:

  • There actually was a lot of that kind of data to be found -- even buried in the 35 billion words of an otherwise rather specialized TAUS corpus.
  • That data could not be found with the traditional metadata filters, but only by submitting a sample corpus on the basis of which the required data could be assembled.
  • This newly unearthed data might be better training data for neural machine translation anyway.

In response, a new system was launched and named Matching Data.

I'm not building my own machine translation engine, and I'm also not considering purchasing third-party data from TAUS, but I'm still really intrigued by these findings.

We (and I definitely include myself here) have always preached about the need for metadata. On the level of the translator, that certainly was the case for termbases and translation memories. I always said, "If you don't enter metadata along with your term pairs (in the termbase) or translation units (in the translation memory), these records will become useless over time because you won't be able to evaluate how applicable they are for later projects and you lose your ability to maintain your resources."

For termbases this is certainly still the case. Unless you are interested in having only per-project glossaries, you will have to spend more time entering data than just committing a term pair.

But translation memories? I think this is an interesting question that will have to be looked into -- I can certainly see a number of upcoming master theses that deal with this. Especially because we have been working with translation memory technology that is fundamentally different from the technology of 10 years ago: We translators (and most technology vendors) have realized that aside from helpful perfect matches, the true value of TMs is not in long fuzzy matches but in subsegments, which, much like in the Matching Data product of TAUS, really are matched on a granular rather than a metadata level.

Back to TAUS. The genesis of the Matching Data product offering actually goes back a little further than the Oracle story (which you can see described and presented right here). In the "DatAptor" project, TAUS had been collaborating with the University of Amsterdam to look into the very possibility of using data to find other data, so this idea is not as off-the-cuff as it might seem.

In practice, anyone can request data by uploading a sample corpus, for which three different sets of data will be produced in TMX format. The three sets are increasing in size and decreasing in quality (as related to the sample data). You don't have to commit to actually purchasing or paying anything until you are presented with samples of the resulting corpora (you can see examples of that when you open the Oracle example linked to above). I'm not sure that for translation memory purposes the review of a sample is something that will truly help in making a decision for or against purchasing data (after all, as an LSP, say, I'm primarily interested in how I can increase my match rate, and I won't be able to tell that from looking at a subset of data), but for machine translation training purposes, the review process might be helpful.

Once the corpora are created, they are also offered as "libraries" to other potential purchasers.

Two additionally interesting things came up when Jaap and I talked. One is a difference in IP concerns. Previously, when data was assembled and purchased, the originator of the data was not only visible but his name was part of the metadata that was used to select the data. Now, with this new way of selecting data, all metadata is stripped, including the name of the originating company, which should make it seem less risky for companies to commit their data for purchase.

In addition, with this method you will likely find certain data that is particularly popular (if this model "takes off"). By definition, popular data is more valuable and should be compensated differently. The way to track that?

Hello, blockchain!

For those of you who are closely following developments in the world of translation, you know that the use of blockchain (the cryptography system that tracks each transaction of a record like it does for cryptocurrencies) has been discussed a lot for use in translation-related contexts. While most people (moi included) think it's mostly a bunch of hot air to grab attention, here it actually might make sense -- maybe. When I brought this up to Jaap, he said that they had already thought of it under the moniker of "Marketplace" (see under point 7 right here). And aren't you all proud of me that I didn't say "Hey, that's exactly what Donna Parrish and I named our company when we started TM Marketplace in 2005." Oh, well.



Terminotix announces SynchroTerm 2019 which offers the following new features:

  • In order to be compliant with the European Commission, the following languages have been added: Gaelic, Maltese, Estonian and Croatian.
  • Systran PNMT integration: 1) Allows to create export/export Systran Dictionary formats; 2) Allows to create terminology pre-processors and/or post-processors Normalizations lists.
  • LogiTerm Web integration for sharing and collaborating terminology extraction projects amongst a team of terminologists.
  • Terminology recall index for Machine Translation output comparison purposes.
  • Terminology creation user interface navigation greatly improved.


2. Humility is Good

Two months ago I talked with Christian Weih-Sum, Chief Sales Officer at Across, and I intended to report on our conversation and the newly released Across Translator Edition 7 right after that. But then all kinds of other things came up to fill the Tool Box Journals in between our conversation and now. At this point, version 7 has actually been released, so I assume all the Across users among you have already had a chance to use it, but some of what I learned is worthwhile to mention nevertheless.

Many of you remember that access to the translator edition of Across used to be free. This changed at the end of 2015 when a paid membership to crossMarket was required to have a standalone version of the Across Translator Edition. (It was standalone in the sense that you could use it for your own projects as well as those of clients, including for several clients at a time.) The membership costs 19 euro/month with a total of more than 17,000 members. It's a good thing to know that number since it gives us a good idea how widely-used Across is -- relatively wide, that is, considering that the core market is in the European market and in particular DACH. Christian was very candid about numbers and growth, including that crossMarket showed great growth the first year and then slowed somewhat due to some personnel changes but is now again seen as a "growth sector" -- including for non-users of Across. Now the idea is to integrate crossMarket into Language Server -- the Across part of the tool that clients are using -- so that clients can find providers right within their tool. I think it's kind of a no-brainer, and while it's something that others have tried and failed with, it might work really well for Across. From the very beginning I had said I thought that Across might be the only example of a tool provider that also provides a reasonable marketplace/job board, and while the verdict is still out, I think they're on a good path here (the other tool that has some potential in that is possibly Smartcat).

Speaking of stuff "you might remember" about Across, it was interesting to hear Christian talk about the past as something of a burden to Across -- especially among those who used Across in the past, formed an opinion, and have not used it since but are still happy to share their negative experience with, say, version 3.5. (And "version 3.5" is not really such a random example. I also wrote about it and it was not good...) It's not unlike Microsoft's experience after poor releases of Windows (think Windows Me or Windows 8). According to Christian, this is a hard impression to change, but it is changing, especially with young translators who have no problem with Across today.

One thing that is different about this version (and at least one or two before that) is that the user groups Across works with to consult and shape their products now include freelance translators (this may very well point to the fact that there is something to paying for a product, like "being taken more seriously").

You can read about some of the features aimed at greater usability right here. The ones I found most interesting (not from the perspective of other tools but from the perspective of the previous version of Across) are the inclusion of various new MT plugins (Google Translator, KantanMT, Systran, DeepL, Moses), the commitment to develop other plugins for free if there is a request and more than one customer who wants to use it, and the (limited) possibility of simultaneously opening several tasks.

One thing I have admired about Across for a while is their full feature-parity between the desktop and browser-based interfaces for the translator. This is an area where they are ahead of any of their competitors. I asked Christian who uses what and to what degree, and he said since the clients control which interface the translators use, it's those clients who want translators to use the browser interface who are interested in "process authority," don't want to send out any data, and need an easier setup for projects.

Since process control has been an important selling point for the Across Language Server for years, it's not too surprising that many customers on the client side are still interested in it. The two items that they are more interested in than any other are, according to Christian, "data security" and "process automation." (Of course, this might have something to do with the fact that the majority of Across customers are in German-speaking countries....) It's not that these would not be relevant for users of other tools, but I would expect those to rate "cost" and "quality" just as high.

Many Across customers are using machine translation in some form today, and there's one interesting feature that I had not encountered in other tools (or maybe just overlooked): by default (though this can be changed if needed), a TM translation unit that has "only" been post-edited from machine translation is penalized since it is assumed to be of lesser quality than if translated from scratch. Smart move.

Here is my favorite thing that Christian said in our talk. He said, "We've become more humble." For instance, they don't feel the need to go to conferences in parts of the world where they don't have many customers. I like humble.



Free Online Translation Memory Training with SDL

As part of our 35 years of Trados celebrations, we're offering exclusive free translation memory (TM) training with expert SDL Trados Studio trainers.

Take advantage of these sessions designed for both new and experienced translators and project managers interested in learning best practices and how to get the most from a TM.

These sessions are available in 11 languages and take place exclusively in March.

Sign up for your free translation memory training >>


3. MMT

A bit more than two years ago (in edition 281), I said this about ModernMT:

"The other engine that has just gone live (within the LSP Translated's translation environment tool MateCat and the paid Pro edition of the MyMemory app for SDL Trados Studio) is ModernMT. ModernMT is a three-year EU-funded project with a number of partners, including Translated, much like MateCat itself was a few years ago. If you remember, MateCat's original purpose was to 'investigate the integration of MT into the CAT workflow [with] self-tuning MT that adapts MT to specific domains or translation projects and user-adaptive MT that quickly adapts from user corrections and feedback' (Source: Proceedings of the 17th Annual Conference of the European Association for Machine Translation). While the adaptive system worked reasonably well, that part was unceremoniously and frustratingly dropped from MateCat, and the EU agreed to confer another three-year contract. This time the adaptive MT is here to stay, according to Translated's Alessandro Cattelan, whom I spoke to for this report.

"The adaptive part of the technology is fundamentally different than other adaptive engines because there are actually no changes in the baseline engine happening at any time. Instead, the system uses a technology called 'instance-based adaptive NMT.' Similar to KantanMT but for a different purpose, this consists of the translation request first being sent to a TM layer (which can consist even of a relatively small TM as long as it's highly-tuned). With similar segments found in that TM layer, the NMT engine's 'hyperparameters' are adapted on-the-fly so that a more suitable suggestion is generated. This concept is based on this paper by the Fondazio Bruno Kessler, which is part of the consortium working on ModernMT [along with Translated, The University of Edinburgh, TAUS, and the European Commission].

"The benefit -- in theory -- is that you don't ever need to actually train a specific MT engine, but you can instead use a large generic engine whose suggestions are specialized by having the query parameters adapted as the translation is happening."

Again, that was two years ago. Now the research part of the project is finished, the tool has been separated into a commercial version and an open-source version (much like MateCat), and the commercial version is available for these language combinations: English <> ES, IT, PT-PT/BR, DE, FR, NL, RU, AR, ZH-CN/TW, JA, CA, and English > BS (Bosnian), HR (Croatian), SR (Serbian Latin), FI, EL, PL, CS, HU, DA, ND/NN (Norwegian Bokmål and Nynorsk), IS, SV, TR, ID, MS, KO, TH. (Forgive me for spelling some of the abbreviations out, but I'm assuming my own ignorance may also be shared by others -- and for BS/HR/HR, see this recent and interesting article.)

The reason why I'm revisiting MMT is twofold. First, they are now offering a product that is usable for individual translators, and second, reader Michael Beijer contacted me after testing it and being impressed by the product but shocked about the price (which is presently app. 4 euro/1,000 words for companies and 89 euro/month for up to 100,000 translated words for translators).

So it's significantly more expensive than its competitors (including DeepL), but Davide Caroselli, the lead developer and product owner, justified it by the high quality of its results and by how the tool is set up (it's adaptive and, because of the way it works -- see above -- you don't need to have multiple engines for different clients; you just need to upload the relevant TMs and either choose at runtime which one to use or have ModernMT guess the best matching TMs).

I was particularly interested in the ModernMT concept of uploading TMs and the resulting issues surrounding their safeguarding. So I asked Davide for a little more detail:

I would like to understand how you safeguard the translation memories and the training data of the individual user. I assume that none of the data that I'm uploading or using is shared with anyone else. Is that correct?

"That's right. The content you upload (TMs and post-edits) is not shared with anyone but you and ModernMT. More in detail, no user can use your data to adapt their translations, and this is of course true the other way around: you will be able to adapt only the data you upload, so there is no risk of "polluted" data coming from other users. The only operation ModernMT can do with data you upload, besides adaptation, is to use that data together with multi-billion-word corpora to train the baseline models once in a while (typically 1-2 times per year) in order to boost the translation quality of the system."

I think this is problematic from the viewpoint of many users (translators, but especially their clients). Now, I understand that in the case of NMT it's virtually impossible to rebuild text from training data, so in reality there is no true privacy concern. On the other hand, this is not what most users (or their legal departments) believe, and that's why systems like Goggle Translate, Bing Translator, and DeepL have expressly committed to not using the data for training purposes if users use their API. Especially considering that your system is quite a bit more expensive, I think you will experience some backlash there. What do you think?

"You're right, we also find it difficult at first to explain the privacy issue to our customers. As you said first, the most important aspect to make clear is that with an NMT model the data used for training is 100% secured and there is no way anybody can reverse-engineer the data by using (or even having!) the model. The other important aspect concerns translation quality. The aim of ModernMT is to provide the best MT possible, and that's why we continuously search for better models or better training techniques; more importantly, we already use the translators' data in order to provide the best translation experience possible, and it is done in a private way. Using data to train the baseline engine once in a while does exactly that: improves translation quality, without harming the data privacy in any way.

"On the other hand, we know that there are still cases in which this is simply not feasible, mostly for legal reasons. That's why we also have the on-premise model, where we basically offer our technology hosted on the customer's servers for maximum privacy. This is usually an option for enterprises, of course, given the need to have a proprietary server installation with expensive hardware."

Hmm. I think this is an interesting -- and slightly odd -- choice. From my vantage point (which, like most others, is limited) there was a bit of a relaxation when it came to IP rights with the increasing use of cloud servers and services. This all seemed to change with the enactment of the European GDPR. Case in point: even Microsoft (and DeepL) changed their rules when it came to the professional use of their machine translation services. I think it would be fair to say that Translated and its CEO Marco Trombetti always liked to see themselves as trailblazers when it came to pushing against the norm and breaking down barriers (after all, data sharing is at the very heart of MyMemory). But I wonder whether this might be a major stumbling block at this particular point in time. I suspect it might be, especially given the higher than normal price point (and ModernMT seems to already be experiencing this problem). 



Speed up Your Translation Processes with Across v7

The new version of the Across Language Server and the Across Translator Edition is now available! We have addressed numerous subject areas in order to improve the user-friendliness, to reduce flow times, and to enable new working styles. 

Get your Across Translator Edition v7


The Last Word on the Tool Box Journal

If you would like to promote this journal by placing a link on your website, I will in turn mention your website in a future edition of the Tool Box Journal. Just paste the code you find here into the HTML code of your webpage, and the little icon that is displayed on that page with a link to my website will be displayed.

Here is a reader who added the link last month: 

If you are subscribed to this journal with more than one email address, it would be great if you could unsubscribe redundant addresses through the links Constant Contact offers below.

Should you be interested in reprinting one of the articles in this journal for promotional purposes, please contact me for information about pricing.

© 2019 International Writers' Group    


Home || Subscribe to the Tool Box Journal