Tool Box Logo

 A computer journal for translation professionals


Issue 17-6-275
(the two hundred seventy fifth edition)  

Contents

1. Mushrooms After a Drought

Dear Basic Subscribers . . .

2. Microsoft's Communication

3. The Tech-Savvy Interpreter: Training Interpreters Online -- Taking Stock after One Year

4. Localization Essentials

5. Confessions

6. My Journey into "Neural Land" (by Terence Lewis) (Ⱦ)

7. New Password for the Tool Box Archive

The Last Word on the Tool Box

Doodles

No translators or interpreters worth their salt will have missed the announcement last month that the UN finally and officially declared September 30 as International Translation Day. Particularly no translators from Azerbaijan, Bangladesh, Belarus, Costa Rica, Cuba, Ecuador, Paraguay, Qatar, Turkey, Turkmenistan, and Vietnam, who should be proud of their countries for being the signatories to the resolution.

International Translation Day had been promoted by FIT, the International Federation of Translators, since 1953, and in 1991 a push was started to have an external recognition such as the one that was just conferred.

In my opinion, another impactful milestone would be the creation of a Google Doodle to celebrate the occasion on September 30 this year. I can think of a million reasons -- billions, in fact -- for Google to honor that day: one for each of the words we translated that are now used by Google Translate, for instance. I've encouraged you in previous years to join me in asking for this, and I would love for you to do it again, maybe even more strongly this time. This newsletter is read by about 12,000 translation professionals -- can you imagine if every one of us contacted the Google Doodle team?

Don't know how? Well, their email address is proposals@google.com. Or you could retweet this tweet, send one of your own, use other social networks, contact your favorite person who works for Google, or . . .

Because remember, we're the ones who lucked out. Unlike poor folks like Meryl Streep who "wanted to be a translator at the UN and help people understand each other." And what happened to her? She just became a -- what was it again? Right, I forgot -- I think she's an actress now.

ADVERTISEMENT

memoQ 8.1 is now available

New PDF import. New Find/Replace. Better preview. And many more.

Download. Install. Enjoy working with memoQ.

www.memoq.com/downloads

 

1. Mushrooms After a Drought

Lately I've written quite a bit in the Tool Box Journal about project management and invoicing tools for freelance translators, tools that seem to have cropped up like mushrooms after a night of rain. This is actually not a very good analogy because they've come more in response to a drought by the market leader AIT with its rather outdated Translation Office 3000 (TO3000).

Well, that drought has come to an end with the brand-new TO3000 3D. (Of course, the problem for AIT now is that all those other "mushrooms" won't just go away.)

Before I share some of the impressions I had after using the new version for a few days, let me tell you a bit about a conversation I had with Vladimir Pedchenko, AIT's CEO.

There were many reasons for the "drought," according to him, including the move to a completely new development platform and underlying database engine (MS SQL Express instead of the open-source Firebird SQL), a complete redesign of the interface to make it touch-friendly and modern-looking, and the forking off from the development of AIT's product for translation agencies, Projetex (previously TO3000 was essentially a limited version of Projetex).

In my opinion, though, an equally important reason for the delay was a change of vision for AIT. There was a time when they released several completely new products a year, not to mention upgrades to existing ones. They were the ones who essentially invented project management software for the freelance translator -- reflected in a proud 5000+ licensed copies of TO3000 -- and envisioned eventually bringing all their products into one large combined product that would do everything, including translation memory and terminology management, time and volume tracking, and so on and so forth. While this sounded attractive, it didn't happen for two reasons: it ignored the realities of a multi-faceted market where freelance translators are only one, albeit important, constituency; and their actual translation solution AnyTM was not truly innovative (according to Vladimir, it was more of a simplified version of the old pre-Studio Trados).

Add to that the turmoil in Ukraine (AIT is Kiev-based) as well as within the company (I previously reported about that in connection with Technolex), and things did not move as quickly as they could have.

With that said, the new product looks good, seems to be stable after a round of alpha and beta testing, and will delight existing users of TO3000 versions 11 and earlier.

I already mentioned the more modern-looking interface, which includes the ability to work in several windows at the same time (making it a lot easier to compare data entry between clients, for instance), it's optimized for those who like to work with Windows-based touch screens (yes, only Windows is supported), and it now offers email integration, file management, and FTP integration. I'm not sure whether it's the new ribbon-based interface or the horizontal versus the previous vertical layout, but it's a lot easier to work in -- and I imagine that this is true not only for upgraders but also new users.

One thing that made earlier versions of TO3000 stand out was the integration of several other AIT products, including the sophisticated word counting tool AnyCount. This is still the case, and I encouraged Vladimir to bring the time-measuring utility ExactSpent into the mix as well -- I wouldn't be surprised to see that soon.

The upgrade process worked smoothly and quickly (though I wish the "Data Import Utility" had started automatically to port data over from previous versions). On the other hand, startup and shutdown speed of the tool seems excruciatingly slow (I assume this has to do with the need to start and shut down the SQL Server program), and I'm a little frustrated that the rather simplistic and very static currency conversion still doesn't dynamically adapt itself to current rates.

Still, I'm glad I've upgraded, and I will continue to use TO3000 as my invoicing and business reporting tool for the foreseeable future.

Vladimir asked me what he can do to get the word out -- especially to those who are not familiar with TO3000 yet -- and I encouraged him to give a special offer to Tool Box Journal readers. That he did with a generous 50% discount right here. And if you as a Tool Box Journal subscriber have already purchased a new license with a lesser discount, write to support@to3000.com and they will make an adjustment.

 

ADVERTISEMENT

crossMarket Job Board Is Coming Soon

Faster translator search, hassle-free access to exciting translation jobs: you will soon be able to find suitable matches in the crossMarket job board.

We are already more than 10,000 members worldwide. Join us now: www.crossmarket.net

 

Dear Basic Subscribers . . .

I've had two related thoughts on my mind lately: 1) I'm tired of cutting articles from the Basic edition of the Tool Box Journal, and 2) your colleagues who own the Translator's Tool Box ebook are hooked on it, proven by their willingness to purchase it again and again each time a new update is available. How do those two thoughts fit together? Well, you might notice that you're receiving a full Tool Box Journal today (and a long one at that). Along with that gift, I'd like to sweeten the pot with this offer: Continue to receive the full Premium edition and purchase the Translator's Tool Box ebook through the rest of June for only $30 (rather than $50 -- just enter 30 as the amount to be paid). That price buys you a free annual subscription for the Journal, and if you look at the ebook and don't like it, just send me a note and I'll reimburse you the $30 -- no questions asked. You can even keep the book (and the subscription).

Let's see whether I can get you hooked as well >:D  (My kids tell me that's supposed to be the evil laugh emoji.)

 

2. Microsoft's Communication

The communication team for translation at Microsoft does not always act in translators' favor. Think, for instance, of then-VP Gurdeep Singh Pall who called Skype's auto-interpreter "pre-beta of magic" in 2014. Or when Director of Product Strategy and Marketing Olivier Fontana called a new voice-to-text translation feature of PowerPoint something "straight out of 'Star Trek'" just a few weeks ago. Those are exactly the images and concepts that don't work in our favor (nor in Microsoft's either, I think).

On a different level but equally unhelpful is Microsoft's lack of communication when it comes to the availability and status of its Microsoft Bing Translator. While there was a lot of fanfare around MS's move into neural machine translation when it came to voice-to-voice, there's been little publicity around the fact that this neural machine translation has also been -- optionally -- available for text-to-text through the API (for translation environment tools) since last October, and that neural machine translation has now become the standard with regular web-based access for English <> Arabic, Chinese (Mandarin and Cantonese), English, French, German, Italian, Japanese, Portuguese, Russian, and Spanish as well as Japanese <> Chinese text translations since April 28 (aside from this announcement about Asian languages).

That's why I'm grateful for Chris Wendt of Microsoft's machine translation group. Chris has been very willing to engage if prodded just a little to share what needs to be shared.

So: For any translation environment tool that supports the new Azure-based Microsoft API (I reported about that in edition 273 of the Tool Box Journal) -- and, mind you, the majority of tools are not equipped to handle the new API -- you'll have a choice to use either the statistical machine translation engine that Microsoft provides or the neural engine. You can do that by entering generalnn as the "Category" in the provided field -- or not. Here is an example of what this would look like to enable neural machine translation in two tools that actively support this, Déjà Vu and Memsource:

MS Translator

If you leave the Category field empty, you'll access the SMT engine (you might have to close your tool and open it up again to see the change take place -- at least that's what I had to do). Of course, this works only for the language combinations mentioned above. (If you're unsure which of the two engines you are "getting," you can always check right here.)

(Not that this makes much difference to you, but in my limited tests I actually found the statistical machine translation engine more reliable than the neural engine, even though the latter "reads" much better.)

Oh, and in case you wondered: Unlike Google, which (assures us that it) is not using your data if you use the translation API, Microsoft (still) does, so for many of you this really might be a piece of not-applicable-news.

 

ADVERTISEMENT

Need a system to manage your translation agency? Try Protemos!

Clients, vendors, projects management, invoices and payments, business reports and more...

Free for freelancers, 3 month trial for agencies, try it now

 

3. The Tech-Savvy Interpreter: Training Interpreters Online -- Taking Stock after One Year (Column by Barry Slaughter Olsen)

Be sure to watch this month's Tech-Savvy Interpreter video.

I recently finished my first year of teaching interpreting online for the Middlebury Institute of International Studies (MIIS). All told, I spent over 450 contact hours teaching both consecutive and simultaneous interpreting and some sight translation remotely. It has been an amazing learning experience for me and for my students. In September 2016, in the Tech-Savvy Interpreter, I presented a number of technologies I use to teach consecutive interpreting via video conference. In this edition, I'd like to share some initial lessons learned -- three methodological and three technological.

Methodology

Let me start with methodology because no amount of cool technology will make a poorly designed class successful. In fact, if not appropriately applied, technology becomes a distraction.

My first conclusion? Well, after a year of teaching online, I can say without a doubt, interpreting can be taught and learned successfully online, but successful online teaching requires discipline and initially, a lot of additional preparation time to be done well. The need to maximize every minute of online synchronous time with my students has forced me to give greater order and structure to my teaching and to carefully define what my learning objectives are for each class. Not that my classes were unstructured in the past, but when teaching remotely, wasted seconds can seem like minutes and minutes, an eternity. So, doing all you can to squeeze every bit of usefulness out of your synchronous class time is crucial.

Second, converting my usual in-class "lecture time" to online videos for watching outside of class helped maximize face-to-face or synchronous instruction time for live interpreting practice and feedback. In eduspeak this is known as "flipping the classroom" (See "7 Things You Should Know about Flipped Classrooms").Providing short video lectures is more demanding and time consuming that I initially thought, but the dividends for student learning in interpreting are clear. Producing the videos, even if they are going to be simple and of a relatively low production value takes a lot of time up front. As any educator knows, curriculum design is a demanding and time-consuming process. This month's Tech-Savvy Interpreter Video includes one of my short video lectures from an introductory consecutive interpreting course. For these videos I use an enterprise-class video management platform called Panopto provided by MIIS (Thanks!). But this is not required, you could easily use a consumer-level video platform like YouTube or Vimeo to much the same effect.

My last observation on online teaching methodology is that ironically some things are easier when working remotely. As I'll explain next in the Technology section, software-defined videoconferencing in the cloud has made some huge leaps forward in the last year. Video and audio recording (including multiple files from one videoconference session) is now as easy as pressing a button. This means that remote interpretation testing is a snap. The videoconferencing platform I switched to in January, Zoom Video, allows me to share videos on screen along with my computer audio for testing. Or I can bring in a live speaker from anywhere in the world to give a speech. Individual student exam recordings (both video and audio only) are available for download from the Zoom Video platform or I can choose to record them directly to my computer's hard drive. The simple result is a lot of time and headache saved.

Technology

And now for my three technological observations.

First, as alluded to in my last point, there is a revolution going on right now in videoconferencing. Cloud-based videoconferencing platforms are making it easier and much less expensive. Until January, I had been using an enterprise-class videoconferencing solution by Polycom to teach my consecutive interpreting classes. I switched from Polycom to Zoom Video for the spring semester, which improved the experience for me and for my students. Connecting is easier, and I now have an entire suite of collaboration tools that I can use to teach (e.g. screen share, whiteboard, video and audio sharing) all on one platform. One unexpected benefit from using a cloud-based platform was that I was able to offer some much-needed flexibility for a student who needed to attend class from home a couple of times during the semester.

Second, ease of use is a big factor for any technology you want to employ in online teaching. Getting on and off a platform needs to be as simple as possible. In order to teach interpreting (both consecutive and simultaneous) I have had to employ multiple conferencing platforms and technologies. Each new platform or technological solution introduces complexity into the teaching equation. So, the technology must have a clear pedagogical purpose. In addition, you must factor in the time at the very beginning to get every student up and running and comfortable with each technology solution you will be using. If you don't, this can eat into your synchronous class time during the semester -- something you definitely want to avoid.

Third, not everything is perfect with the remote teaching model. I'll be the first to admit that there are some things that would just be easier if I were physically in the same space as my students and fellow interpreting professors. Some students were flustered when the technology did not perform perfectly, which is understandable. Ironically, however, I interacted more with my students using videoconferencing, WhatsApp and online journaling this academic year than I have face to face in past years when on campus. Even so, I think it is imperative to establish a good student-professor relationship from the beginning. This is why I still travel Monterey at the beginning of each academic year to meet with my students and teach the first two classes face to face. I then return to teach two or three more times throughout the academic year. So, my teaching model is actually blended (about 80% online and 20% face to face).

Conclusion

These are only a handful of the many insights I have gained over the last academic year as I have taught interpreting online. In my mind, the advantages clearly outweigh the disadvantages. The trend toward more online interpreter and translator training is clear. The number of online course offerings is growing quickly, and well-prepared interpreter trainers are now offering individualized and group instruction online. I expect this trend to grow given the huge need for interpreter training all over the world and the inability of current academic models to fill that need completely.

Do you have a question about a specific technology? Or would you like to learn more about a specific interpreting platform, interpreter console or supporting technology? Send us an email at inquiry@interpretamerica.com.

 

ADVERTISEMENT

Linguali is the next generation mobile-based interpretation solution

Interpreters use a PC laptop as a console while delegates listen to the interpreters using their own smartphone. Linguali works stand alone or integrates with a sound desk, with high quality full-band sound for both interpreters and participants. Linguali is mobile, scalable and smart. It deploys in minutes for up to 60 participants. 

Learn more and download your free trial at linguali.com.

 

4. Localization Essentials

Google has released a free six-part online course called Localization Essentials. Those of you who have been around the block would not particularly benefit from it, but if you're new to the field, you might find it interesting (plus they're YouTube videos so you can play them at double speed). Naturally you'll find some places where you'll disagree -- when it comes to pricing, for instance, or the exclusive focus on Google Translator Toolkit as a translation environment tool (not sure why they still use the term "CAT tool" in the course . . .) -- but if you're able to filter out those shortcomings, you might benefit from it. The six modules are Key Concepts, Content Types, Profiles & Skills, Processes, Tools, and Glossary & Resources

 

ADVERTISEMENT

Whether you're new to translation and localization technology, or experienced in both, SDL Trados has a huge library of free resources to help you develop your careers and businesses.

Resource topics include:

See a full list of all educational resources.

 

5. Confessions

No, not the ones from Augustine, but very heart-wrenching all the same and likely to cause mixed feelings for some of you. A translator friend who translates between two European languages with typically average results for (statistical) machine translation (better than, say, EN<>DE but worse than ES<>FR) and who wants to stay anonymous (which really is part of the story but which I will otherwise leave uncommented) shared this with me after the last Tool Box Journal:

"I am getting old, I am and always have been a lousy typist, I am a slow, deliberate translator, I hate Trados, on which I have spent way too much money and grief throughout the years. So, I started using Google Translate some years ago, and in the type of stuff that I do, usually medical journal articles and medical reports, I discovered that in some cases it did better than I could, plus I did not have to type it from scratch. Now, several years later I am still using this method, but I have noticed that it has gotten increasingly better. It now transposes phrases and handles syntax a lot better -- at times even perfectly. So, in a nutshell, MT has made me a better and a faster translator. I now mess around in my garden during the day and do my translations after supper and I can do 2 to 3K words by midnight. 

 

"So, I am a believer and you won't have to convince me, but this is still not something I can tell my colleagues about, as it feels a little bit like cheating. But then again, I am the one who decides what is correct, and that is because of my knowledge of the two languages."

 

ADVERTISEMENT

Leave the office 20 minutes earlier today!

Using MindReader for Outlook, you can write e-mails more quickly and more consistently.

Watch the short video for more information on MindReader for Outlook functionality and usage: https://www.youtube.com/watch?v=YAPLHSvVrBc 

 

6. My Journey into "Neural Land" (by Terence Lewis) (Ⱦ)

A little more than a year ago, in the 260th edition of the Tool Box Journal, I published an article about Terence Lewis, a Dutch-into-English translator and autodidact who took it upon himself to see what machine translation could do for him beyond the generic possibilities out there. He taught himself the necessary programming from scratch, once for rules-based machine translation, again when statistical machine translation became en vogue, and, you guessed it, once again for neural machine translation. I have been and still am impressed with his achievement, so I asked him to give us a retelling of that last leg of his journey. (And by the way, the Ⱦ in the title stands for "technical").

It all started with a phone call from Bill. "B***dy hell, Terence", he shouted, "have you been on Google Translate recently?" He was, of course, referring to Google's much publicized shift from phrase-based statistical machine translation to neural machine translation which got under way late last autumn. Bill, an inveterate mocker of lousy machine translation, had popped a piece of German into Google Translate and, to his amazement, found little to mock in the output. German, it seems, was the first language pair for which Google introduced neural machine translation. I put down the phone, clicked my way over to Google Translate and pasted in a piece of German. To say that what I saw changed my life would be a naïve and overdramatic reaction to what was essentially a somewhat more fluent arrangement of the correctly translated words in my test paragraph than I would have expected in a machine translation. But this was early days and things could only get better is what I told myself.

Around that time the PR and marketing people at Google, Microsoft and Systran had gone into top gear and put out ambitious claims for neural machine translation and the future of translation. Systran's website claimed NMT could "produce a translation overachieving the current state of the art and better than a non-native speaker". Even in a scientific paper the Google NMT team wrote that "additional experiments suggest the quality of the resulting translation system gets closer to that of average human translators", while a Microsoft blogger wrote that "neural networks better capture the context of full sentences before translating them, providing much higher quality and more human-sounding output".

Even allowing for the hype factor, I could not doubt the evidence of my own eyes. Being a translator who taught himself to code I was proud of my rule-based Dutch-English MT system, which subsequently became a hybrid system incorporating some of the approaches of phrase-based statistical machine translation. However, I sensed -- and I say "sensed" because I had no foundation of knowledge then -- that neural machine translation had the potential to become a significant breakthrough in MT. I decided to "go neural" and dropped everything else I was doing.

What is this neural machine translation all about? According to Wikipedia, "Neural machine translation (NMT) is an approach to machine translation that uses a large neural network". So, what's a neural network? In simple terms, a neural network is a system of hardware and software patterned after the operation of neurons in the human brain. Typically, a neural network is initially trained, or fed large amounts of data. Training consists of providing input and telling the network what the output should be. In machine translation, the input is the source data and the expected output is the parallel target data. The network tries to predict the output (i.e. translate the input) and keeps adjusting its parameters (weights) until it gets a result that matches the target data. Of course, it's all far more complex than that, but that's the idea.

Not knowing anything about NMT, I joined the OpenNMT group which is led by the Harvard NLP Group and Systran. According to the OpenNMT website, OpenNMT is an industrial-strength, open-source (MIT) neural machine translation system utilizing the Torch/PyTorch mathematical toolkit. The last two words are key here -- in essence, NMT is math. OpenNMT is written in both the Lua and Python programming languages, but the scripts that make up the toolkit, which are typically 50-100 lines long, are in essence connectors to Torch where all the mathematical magic really happens. Another NMT toolkit is Nematus, developed in Python by Rico Sennrich et al., and this is based on the Theano mathematical framework.

If you're thinking of delving into NMT and don't have any Linux skills, get them first. It's theoretically possible to run OpenNMT on Windows either directly or through a virtual machine, but most of the tutorials you'll need to get up and running just assume you're running Ubuntu 14.4 and nobody will want to give you a lesson in basic Linux. While in theory you can train on any machine, in practice for all but trivially small data sets you will need a GPU (Graphical Processing Unit) that supports CUDA if you want training to finish in a reasonable amount of time. For medium-size models you will need at least a 4GB GPU; for full-size state-of-the-art models 8-12GB is recommended. My first neural MT training from the sample data (around 250,000 sentences) provided on the OpenNMT website took 8 hours on an Intel Xeon X3470, S1156, 2.93 GHz Quad Core with 32GB RAM. The helpful people on the OpenNMT forum recommended me to install a GPU if I wanted to process large volumes of data. I installed the Nvidia GTX 1070 with an onboard RAM of 8GB. This enabled me to train a model from 3.2 million sentences in 25 hours.

However, I'm getting ahead of myself here. Setting up and running an NMT experiment/operation is -- on the surface -- a simple process involving three steps: preprocessing (data preparation in the form of cleaning and tokenization), training and translation (referred to as inference or prediction by academics!). Those who have tried their hand at Moses will be familiar with the need for parallel source and target data containing one sentence per line with tokens separated by a space.  In the OpenNMT toolkit the preprocessing step generates a dictionary of source vocabulary to index mappings, a dictionary of target vocabulary to index mappings and a serialized Torch file -- a data package containing vocabulary, training and validation data. Internally the system will use the indices, not the words themselves. The goal of any machine translation practitioner is to design a model that successfully converts a sequence of words in a source language into a sequence of words in a target language. There is no shortage of views on what that "success" actually is. Whatever it is, the success of the inference or prediction (read "translation") will depend on the knowledge and skill deployed in the training process. The training is where the clever stuff happens, and the task of training a neural machine translation engine is in some ways no different from the task of training a statistical machine translation system. We have to give the system the knowledge to infer the probability of a target sentence E, given the source sentence F (the letters "F" and "E" being conventionally used to refer to source and target respectively in the field of machine translation). The way in which we give the neural machine translation system that knowledge is what differs.

Confession time -- having failed math at school, I've never found anything but the simplest of equations easy reading. When I worked my way through Philipp Koehn's excellent "Statistical Machine Translation" I skipped the most complex equations. Papers on neural machine translation are crammed with such equations. So, instead of spending weeks staring at what could just as well have been hieroglyphs, I took the plunge and set about training my first neural MT engine -- they say the best way to learn is by doing! This was accomplished by typing "th train.lua -data data/demo-train.t7 -save_model demo-model".  This command applied the training script to the prepared source and target data (saved in the file "demo-train.t7") with the aim of generating my model (or engine). Looks simple, doesn't it, but under the hood a lot of sophisticated mathematical operations got under way. These come down to learning by trial and error. As already mentioned, we give our neural network a batch of source sentences from the training data, and these are related word by word to words in the target data. The system keeps adjusting various parameters (weights) assigned to the words in the source sentence until it can correctly predict the corresponding target sentence. This is how it learns.

My first model was a Dutch-English engine, which was appropriate, as I had spent the previous 15 years building and refining a rule-based machine translation system for that language pair. I was delighted to see that the model had by itself learned basic rules of Dutch grammar and word re-ordering rules which had taken me very many hours of coding. It knew when to translate the Dutch word "snel" as "quick" or as "quickly" -- something that my rule-based system could still get wrong in a busy sentence of some length. "De paard die door mijn vader is gekocht" is rendered as "The horse bought by my father" and not "The horse which was bought by my father," reflecting an editorial change in the direction of greater fluency. Another rule the system had usefully learned was to generate the English genitive form so that "de paard van mijn vader" is translated as "my father's horse" and not "the horse of my father," although it did fail on "De hond van de vriend van mijn vader" which came out as "My father's dog" instead of "My father's friend's dog", so I assume some more refined training is needed there.

These initial experiments involved a corpus of some 5 million segments drawn from Europarl, a proprietary TM, the JRC-Acquis, movie subtitles, Wikipedia extracts, various TED talks and Ubuntu user texts. Training the Dutch-English engine took around 5 days. I used the same corpus to train an English-Dutch engine. Again, the neural network did not have any difficulty with the re-ordering of words to comply with the different word order rules in Dutch. The sentence "I want to develop the new system by taking the best parts of the old system and improving them" became "Ik wil het nieuwe systeem ontwikkelen door de beste delen van het oude systeem te nemen en deze te verbeteren". Those who read any Germanic language will notice that the verbal form "taking" has moved seven words to the right and is now preceded by the particle "te". This is a rule which the system has learned from the data.

So far, so good. But -- and there are a few buts -- neural machine translation does seem to have some problems which perhaps were not first and foremost in the minds of the academics who developed the first NMT models. The biggest is how to handle OOVs (Out of Vocabulary Words, words not seen during training) which can prove to be numerous if you try to use an engine trained on generalist material to translate even semi-specialist texts. In rule-based MT you can simply add the unknown words either to the general dictionary or to some kind of user dictionary but in NMT you can't add to the source and target vocabularies once the model has been built -- the source and target tokens are the building blocks of the mathematical model.

Various approaches have been tried to handle OOVs in statistical machine translation. In neural machine translation, the current best practice seems to be to split words into subword units or, as a last resort, to use a backoff dictionary which is not part of the model. For translations out of Dutch I have introduced my own Word Splitter module which I had applied in my old rule-based system. Applied to the input prior to submission to the NMT engine, this ensures that compound nouns not seen in the training data will usually be broken down into smaller units so that, for example, the unseen "fabriekstoezichthouder" will break down into fabriek|s|toezichthouder and be correctly translated as "factory supervisor". With translations out of English I have found that compound numbers like "twenty-three" are not getting translated even though these are listed in the backoff dictionary. This isn't just an issue with the engines I have built. Try asking Systran's Pure Neural Machine Translation demonstrator to translate "Two hundred and forty-three thousand fine young men" into any of its range of languages and you'll see some strange results -- in fact only the English-French engine gets it right! The reason is that individual numerical entities are not seen enough (or not seen at all) in the training data, and something that's so easy for a rule based system becomes an embarrassing challenge. These issues are being discussed in the OpenNMT forum (and I guess in other NMT forums as well) as researchers become aware of the problems that arise once you try to apply successful research projects to real-world translation practice. I've joined others in making suggestions to solve this challenge and I'm sure the eventual solution will be more than a workaround. Combining or fusing statistical machine translation and neural machine translation has already been the subject of several research papers.

Has it all been worth it? Well, customers who use the services provided by my translation servers have (without knowing it) been receiving the output of neural machine translation for the past month and to date nobody has complained about a decline in quality! I have learned something about the strengths and weaknesses of NMT, and some of the latter definitely present a challenge from the viewpoint of implementation in the translation industry -- a translation engine that can't handle numbers properly would be utterly useless in some fields. I have built trial MT engines for Malay-English, Turkish-English and Lithuanian-English from a variety of bilingual resources. The Malay-English engine was built entirely from the OPUS collection of movie subtitles -- some of its translations have been amazingly good and others hilarious. I have conducted systematic tests and demonstrated to myself that the neural network can learn and its inferences involve more than merely retrieving strings contained in the training data. I'll stick with NMT.

Are my NMT engines accessible to the wider world? Yes, a client allowing the translation of TMX files and plain text can be downloaded from our website, and my colleague Jon Olds has just informed me that plug-ins to connect memoQ and Trados Studio (2015 & 2017) to our Dutch-English/English-Dutch NMT servers will be ready by the end of this month. Engines for other language pairs can be built to order with a cloud or in-premise solution.

 

7. New Password for the Tool Box Archive

As a subscriber to the Premium version of this journal you have access to an archive of Premium journals going back to 2007.

You can access the archive right here. This month the user name is toolbox and the password is dingodominance.

New user names and passwords will be announced in future journals.

 

The Last Word on the Tool Box Journal

If you would like to promote this journal by placing a link on your website, I will in turn mention your website in a future edition of the Tool Box Journal. Just paste the code you find here into the HTML code of your webpage, and the little icon that is displayed on that page with a link to my website will be displayed.

If you are subscribed to this journal with more than one email address, it would be great if you could unsubscribe redundant addresses through the links Constant Contact offers below.

Should you be interested in reprinting one of the articles in this journal for promotional purposes, please contact me for information about pricing.

© 2017 International Writers' Group    

 

Home || Subscribe to the Tool Box Journal

©2017 International Writers' Group, LLC