ADVERTISEMENT
|
Less
than one week to register for Memsource's Bay Area Meetup!
Learn
about leveraging innovative translation tech in global businesses and
network with localization professionals in the area.
Save
your seat now
|
1.
The End of Metadata 😢
|
TAUS,
the Translation Automation User Society, just celebrated its 10th
anniversary, and chances are you are aware of its existence. If not,
the easiest way to describe it might be this: It's an association made
up of very large translation buyers as well as some other organizations
and companies that try to advance translation technology, in particular
machine translation. It does that by educating its members and other
stakeholders, generating networking possibilities, providing a
framework to test (machine) translated data, and collecting and
disseminating data among its members (their four "pillars" accordingly
are Connect, Discover, Measure, and Improve).
TAUS's
passionate director is Jaap van der Meer, who truly is an industry
veteran (if ever there was one). It's probably fair to say that his
vision drives TAUS, though I'm sure it's ably co-piloted by its 15
employees and its members. The fate of the visionary, however, is that
some ideas work and others don't. And so it has been with TAUS. Some
early ideas of data sharing, for instance, did not work out as they
were first envisioned, but other concepts like the DQF, the
Data Quality Framework, are widely used today, and the terminology
search of the TAUS Data cloud has a loyal following among those who
have discovered it (as you should if you haven't yet done so) and
integrated it into their workflow (this, by the way, is a very generous
donation to the translator community, if you ask me).
One
of the original core missions of TAUS was data exchange between members
or, added at a later point, the ability to purchase data. A taxonomy
was created that allowed you to select the data you needed with the
help of defining metadata, or data about the data. You could select
from 17 different industries as well as data owners and content type
(the same kind of fields that are also available in the terminology
search tool mentioned above). This seemed to be a reasonable approach
if your goal was to find data to feed and train very data-hungry
statistical machine translation engines. Just recently, though, a
paradigm change was introduced that suggests that filtering by metadata
actually might be harmful rather than helpful in some cases.
On
one level this is easy to understand. When I talked with Jaap, he
mentioned the example of a company that produced software for financial
transactions. Should the data they submit to the system be submitted as
belonging to the software industry or financial industry? And vice
versa, what should they look for if they look for data? Unless you have
very sophisticated, multi-layered filter mechanisms, you are guaranteed
to miss out on some data and receive some that is less than helpful.
So
recently when one large member company, Oracle, was on the hunt for
"colloquial data" -- non-specialized data for chat bots, etc. -- TAUS
discovered three things:
- There
actually was a lot of that kind of data to be found -- even buried in
the 35 billion words of an otherwise rather specialized TAUS corpus.
- That
data could not be found with the traditional metadata filters, but only
by submitting a sample corpus on the basis of which the required data
could be assembled.
- This
newly unearthed data might be better training data for neural machine
translation anyway.
In
response, a new system was launched and named Matching Data.
I'm
not building my own machine translation engine, and I'm also not
considering purchasing third-party data from TAUS, but I'm still really
intrigued by these findings.
We
(and I definitely include myself here) have always preached about the
need for metadata. On the level of the translator, that certainly was
the case for termbases and translation memories. I always said, "If you
don't enter metadata along with your term pairs (in the termbase) or
translation units (in the translation memory), these records will
become useless over time because you won't be able to evaluate how
applicable they are for later projects and you lose your ability to
maintain your resources."
For
termbases this is certainly still the case. Unless you are interested
in having only per-project glossaries, you will have to spend more time
entering data than just committing a term pair.
But
translation memories? I think this is an interesting question that will
have to be looked into -- I can certainly see a number of upcoming
master theses that deal with this. Especially because we have been
working with translation memory technology that is fundamentally
different from the technology of 10 years ago: We translators (and most
technology vendors) have realized that aside from helpful perfect
matches, the true value of TMs is not in long fuzzy matches but in
subsegments, which, much like in the Matching Data product of
TAUS, really are matched on a granular rather than a metadata level.
Back
to TAUS. The genesis of the Matching Data product offering
actually goes back a little further than the Oracle story (which you
can see described and presented right
here). In the "DatAptor"
project, TAUS had been collaborating with the University of Amsterdam
to look into the very possibility of using data to find other data, so
this idea is not as off-the-cuff as it might seem.
In
practice, anyone can request data
by uploading a sample corpus, for which three different sets of data
will be produced in TMX format. The three sets are increasing in size
and decreasing in quality (as related to the sample data). You don't
have to commit to actually purchasing or paying anything until you are
presented with samples of the resulting corpora (you can see examples
of that when you open the Oracle example linked to above). I'm not sure
that for translation memory purposes the review of a sample is
something that will truly help in making a decision for or against
purchasing data (after all, as an LSP, say, I'm primarily interested in
how I can increase my match rate, and I won't be able to tell that from
looking at a subset of data), but for machine translation training
purposes, the review process might be helpful.
Once
the corpora are created, they are also offered as "libraries" to other
potential purchasers.
Two
additionally interesting things came up when Jaap and I talked. One is
a difference in IP concerns. Previously, when data was assembled and
purchased, the originator of the data was not only visible but his name
was part of the metadata that was used to select the data. Now, with
this new way of selecting data, all metadata is stripped, including the
name of the originating company, which should make it seem less risky
for companies to commit their data for purchase.
In
addition, with this method you will likely find certain data that is
particularly popular (if this model "takes off"). By definition,
popular data is more valuable and should be compensated differently.
The way to track that?
Hello,
blockchain!
For
those of you who are closely following developments in the world of
translation, you know that the use of blockchain (the cryptography
system that tracks each transaction of a record like it does for
cryptocurrencies) has been discussed a lot for use in
translation-related contexts. While most people (moi included) think
it's mostly a bunch of hot air to grab attention, here it actually
might make sense -- maybe. When I brought this up to Jaap, he said that
they had already thought of it under the moniker of "Marketplace" (see
under point 7 right
here). And aren't you all proud of me that I didn't say "Hey,
that's exactly what Donna Parrish and I named our company when we
started TM Marketplace in 2005." Oh, well.
|
ADVERTISEMENT
|
Terminotix
announces SynchroTerm
2019 which offers the following new features:
- In order to be
compliant with the European Commission, the following languages have
been added: Gaelic, Maltese, Estonian and Croatian.
- Systran PNMT
integration: 1) Allows to create export/export Systran Dictionary
formats; 2) Allows to create terminology pre-processors and/or
post-processors Normalizations lists.
- LogiTerm Web
integration for sharing and collaborating terminology extraction
projects amongst a team of terminologists.
- Terminology
recall index for Machine Translation output comparison purposes.
- Terminology
creation user interface navigation greatly improved.
|
2.
Humility is Good
|
Two
months ago I talked with Christian Weih-Sum, Chief Sales Officer at Across, and I intended
to report on our conversation and the newly released Across
Translator Edition 7 right after that. But then all kinds of other
things came up to fill the Tool Box Journals in between our
conversation and now. At this point, version 7 has actually been
released, so I assume all the Across users among you have
already had a chance to use it, but some of what I learned is
worthwhile to mention nevertheless.
Many
of you remember that access to the translator edition of Across
used to be free. This changed at the end of 2015 when a paid membership
to crossMarket
was required to have a standalone version of the Across Translator
Edition. (It was standalone in the sense that you could use it for
your own projects as well as those of clients, including for several
clients at a time.) The membership costs 19 euro/month with a total of
more than 17,000 members. It's a good thing to know that number since
it gives us a good idea how widely-used Across is -- relatively
wide, that is, considering that the core market is in the European
market and in particular DACH. Christian was very candid about numbers
and growth, including that crossMarket showed great growth the
first year and then slowed somewhat due to some personnel changes but
is now again seen as a "growth sector" -- including for non-users of
Across. Now the idea is to integrate crossMarket into Language
Server -- the Across part of the tool that clients are using -- so
that clients can find providers right within their tool. I think it's
kind of a no-brainer, and while it's something that others have tried
and failed with, it might work really well for Across. From the very
beginning I had said I thought that Across might be the only example of
a tool provider that also provides a reasonable marketplace/job board,
and while the verdict is still out, I think they're on a good path here
(the other tool that has some potential in that is possibly Smartcat).
Speaking
of stuff "you might remember" about Across, it was interesting
to hear Christian talk about the past as something of a burden to
Across -- especially among those who used Across in the past,
formed an opinion, and have not used it since but are still happy to
share their negative experience with, say, version 3.5. (And "version
3.5" is not really such a random example. I also wrote about it and it
was not good...) It's not unlike Microsoft's experience after poor
releases of Windows (think Windows Me or Windows 8).
According to Christian, this is a hard impression to change, but it is
changing, especially with young translators who have no problem with Across
today.
One
thing that is different about this version (and at least one or
two before that) is that the user groups Across works with to consult
and shape their products now include freelance translators (this may
very well point to the fact that there is something to paying for a
product, like "being taken more seriously").
You
can read about some of the features aimed at greater usability right
here. The ones I found most interesting (not from the perspective
of other tools but from the perspective of the previous version of Across)
are the inclusion of various new MT plugins (Google Translator, KantanMT,
Systran, DeepL, Moses), the
commitment to develop other plugins for free if there is a request and
more than one customer who wants to use it, and the (limited)
possibility of simultaneously opening several tasks.
One
thing I have admired about Across for a while is their full
feature-parity between the desktop and browser-based interfaces for the
translator. This is an area where they are ahead of any of their
competitors. I asked Christian who uses what and to what degree, and he
said since the clients control which interface the translators use,
it's those clients who want translators to use the browser interface
who are interested in "process authority," don't want to send out any
data, and need an easier setup for projects.
Since
process control has been an important selling point for the Across
Language Server for years, it's not too surprising that many
customers on the client side are still interested in it. The two items
that they are more interested in than any other are, according to
Christian, "data security" and "process automation." (Of course, this
might have something to do with the fact that the majority of Across
customers are in German-speaking countries....) It's not that these
would not be relevant for users of other tools, but I would expect
those to rate "cost" and "quality" just as high.
Many
Across customers are using machine translation in
some form today, and there's one interesting feature that I had not
encountered in other tools (or maybe just overlooked): by default
(though this can be changed if needed), a TM translation unit that has
"only" been post-edited from machine translation is penalized since it
is assumed to be of lesser quality than if translated from scratch.
Smart move.
Here
is my favorite thing that Christian said in our talk. He said, "We've
become more humble." For instance, they don't feel the need to go to
conferences in parts of the world where they don't have many customers.
I like humble.
|
ADVERTISEMENT
|
Free
Online Translation Memory Training with SDL
As
part of our 35 years of Trados celebrations, we're offering exclusive
free translation memory (TM) training with expert SDL Trados Studio
trainers.
Take
advantage of these sessions designed for both new and experienced
translators and project managers interested in learning best practices
and how to get the most from a TM.
These
sessions are available in 11 languages and take place exclusively in
March.
Sign
up for your free translation memory training >>
|
3. MMT
|
A
bit more than two years ago (in edition 281), I said this about ModernMT:
"The
other engine that has just gone live (within the LSP Translated's
translation environment tool MateCat and the paid Pro
edition of the MyMemory
app for SDL Trados Studio) is ModernMT. ModernMT
is a three-year EU-funded project with a number of partners, including
Translated, much like MateCat itself was a few years ago. If
you remember, MateCat's original purpose was to 'investigate
the integration of MT into the CAT workflow [with] self-tuning MT that
adapts MT to specific domains or translation projects and user-adaptive
MT that quickly adapts from user corrections and feedback' (Source: Proceedings
of the 17th Annual Conference of the European Association for Machine
Translation). While the adaptive system worked reasonably well,
that part was unceremoniously and frustratingly dropped from MateCat,
and the EU agreed to confer another three-year contract. This time the
adaptive MT is here to stay, according to Translated's Alessandro
Cattelan, whom I spoke to for this report.
"The
adaptive part of the technology is fundamentally different than other
adaptive engines because there are actually no changes in the baseline
engine happening at any time. Instead, the system uses a technology
called 'instance-based adaptive NMT.' Similar to KantanMT but
for a different purpose, this consists of the translation request first
being sent to a TM layer (which can consist even of a relatively small
TM as long as it's highly-tuned). With similar segments found in that
TM layer, the NMT engine's 'hyperparameters' are adapted on-the-fly so
that a more suitable suggestion is generated. This concept is based on this paper by the
Fondazio Bruno Kessler, which is part of the consortium working on ModernMT
[along with Translated, The University of Edinburgh, TAUS, and the
European Commission].
"The
benefit -- in theory -- is that you don't ever need to actually train a
specific MT engine, but you can instead use a large generic engine
whose suggestions are specialized by having the query parameters
adapted as the translation is happening."
Again,
that was two years ago. Now the research part of the project is
finished, the tool has been separated into a commercial
version and an open-source version (much
like MateCat), and the commercial version is available for
these language combinations: English <> ES, IT, PT-PT/BR, DE, FR,
NL, RU, AR, ZH-CN/TW, JA, CA, and English > BS (Bosnian), HR
(Croatian), SR (Serbian Latin), FI, EL, PL, CS, HU, DA, ND/NN
(Norwegian Bokmål and Nynorsk), IS, SV, TR, ID, MS, KO, TH.
(Forgive me for spelling some of the abbreviations out, but I'm
assuming my own ignorance may also be shared by others -- and for
BS/HR/HR, see this
recent and interesting article.)
The
reason why I'm revisiting MMT is twofold. First, they are now
offering a product that is usable for individual translators, and
second, reader Michael Beijer contacted me after testing it and being
impressed by the product but shocked about the price (which is
presently app. 4 euro/1,000 words for companies and 89 euro/month for
up to 100,000 translated words for translators).
So
it's significantly more expensive than its competitors (including DeepL),
but Davide Caroselli, the lead developer and product owner, justified
it by the high quality of its results and by how the tool is set up
(it's adaptive and, because of the way it works -- see above -- you
don't need to have multiple engines for different clients; you just
need to upload the relevant TMs and either choose at runtime which one
to use or have ModernMT guess the best matching TMs).
I
was particularly interested in the ModernMT concept of
uploading TMs and the resulting issues surrounding their safeguarding.
So I asked Davide for a little more detail:
I
would like to understand how you safeguard the translation memories and
the training data of the individual user. I assume that none of the
data that I'm uploading or using is shared with anyone else. Is that
correct?
"That's
right. The content you upload (TMs and post-edits) is not shared with
anyone but you and ModernMT. More in detail, no user can use
your data to adapt their translations, and this is of course true the
other way around: you will be able to adapt only the data you upload,
so there is no risk of "polluted" data coming from other users. The
only operation ModernMT can do with data you upload, besides
adaptation, is to use that data together with multi-billion-word
corpora to train the baseline models once in a while (typically 1-2
times per year) in order to boost the translation quality of the
system."
I
think this is problematic from the viewpoint of many users
(translators, but especially their clients). Now, I understand that in
the case of NMT it's virtually impossible to rebuild text from training
data, so in reality there is no true privacy concern. On the other
hand, this is not what most users (or their legal departments) believe,
and that's why systems like Goggle Translate, Bing Translator, and
DeepL have expressly committed to not using the data for training
purposes if users use their API. Especially considering that your
system is quite a bit more expensive, I think you will experience some
backlash there. What do you think?
"You're
right, we also find it difficult at first to explain the privacy issue
to our customers. As you said first, the most important aspect to make
clear is that with an NMT model the data used for training is 100%
secured and there is no way anybody can reverse-engineer the data by
using (or even having!) the model. The other important aspect concerns
translation quality. The aim of ModernMT is to provide the best
MT possible, and that's why we continuously search for better models or
better training techniques; more importantly, we already use the
translators' data in order to provide the best translation experience
possible, and it is done in a private way. Using data to train the
baseline engine once in a while does exactly that: improves translation
quality, without harming the data privacy in any way.
"On
the other hand, we know that there are still cases in which this is
simply not feasible, mostly for legal reasons. That's why we also have
the on-premise model, where we basically offer our technology
hosted on the customer's servers for maximum privacy. This is usually
an option for enterprises, of course, given the need to have a
proprietary server installation with expensive hardware."
Hmm.
I think this is an interesting -- and slightly odd -- choice. From my
vantage point (which, like most others, is limited) there was a bit of
a relaxation when it came to IP rights with the increasing use of cloud
servers and services. This all seemed to change with the enactment of
the European GDPR. Case in point: even Microsoft (and DeepL) changed
their rules when it came to the professional use of their machine
translation services. I think it would be fair to say that Translated
and its CEO Marco Trombetti always liked to see themselves as
trailblazers when it came to pushing against the norm and breaking down
barriers (after all, data sharing is at the very heart of MyMemory).
But I wonder whether this might be a major stumbling block at this
particular point in time. I suspect it might be, especially given the
higher than normal price point (and ModernMT seems to already be
experiencing this problem).
|
ADVERTISEMENT
|
Speed
up Your Translation Processes with Across v7
The
new version of the Across Language Server and the Across Translator
Edition is now available! We have addressed numerous subject areas in
order to improve the user-friendliness, to reduce flow times, and to
enable new working styles.
Get
your Across Translator Edition v7
|
The
Last Word on the Tool Box Journal
|
If
you would like to promote this journal by placing a link on your
website, I will in turn mention your website in a future edition of the
Tool Box Journal. Just paste the code you find here into
the HTML code of your webpage, and the little icon that is displayed on
that page with a link to my website will be displayed.
Here
is a reader who added the link last month:
isabelsanllehi.wordpress.com
If
you are subscribed to this journal with
more than one email address, it would be great if you could unsubscribe
redundant addresses through the links Constant Contact offers below.
Should
you be interested in reprinting one of the articles in this journal for
promotional purposes, please contact me for information about pricing.
©
2019 International Writers' Group
|
|