Translation Databases for Web Site Localization
Looking back, the daily challenges of the localization industry four or five years ago seem so simple and straightforward. Though we admittedly didn't think they were easy at the time, the most difficult issues we struggled with were various desktop publishing formats that didn't lend themselves well to translation and some resource file types where we had to be cautious not to corrupt code. Translation database tool developers responded to these needs by supplying filters for these formats, and, because all of the formats were relatively static, the tools became a widely accepted standard in the translation industry.



The range of filters available to support various formats in Déjà Vu.

Well, today's world of localization is certainly not static anymore. While the emphasis used to be on the localization of applications written in C/C++ or Visual Basic and their documentation, today the localization industry is forced to focus mainly on interactive content, the material that powers the new Web-driven economy. And the stress we used to feel about annual or semi-annual updates has now been intensified into weekly, daily or even hourly updates of Web site content. But what does this mean in practical terms?

The New Conditions of the New Economy

First of all, the file formats have changed. While we still deal with FrameMaker and QuarkXpress files or with RC files and other resource formats, we also have to tackle tagged formats such as SGML, XML and, of course, HTML. SGML and XML are highly customizable formats that -- by definition -- are not static. And though HTML is more static, it is increasingly being combined with various scripting languages, including JavaScript, VBScript or database-driven technologies such as Active Server Pages or Cold Fusion.



Assigning an HTML subfilter.

These changes present a variety of challenges for the developers and users of translation database tools. How can a tool provide a reliable filter for a format or a technology that is always changing? How can a tool provide the user who is an expert as a translator but not as a programmer with a user interface that clearly displays what needs to be translated and what needs to be left alone? How can a tool provide access to information that is no longer stored in a small handful of files, but instead in hundreds if not thousands of HTML, ASP or JS files and in databases that are difficult to access?

Finding the Right Tool -- An Example

Clearly, vendors of translation database tools believe that there is a place for their tools in the new Web-based economy, and, indeed, most will argue that their product's relevance is higher now than ever. But plowing through the dizzying array of commercially available translation database tools and evaluating how each meets the requirements of the new economy can be a formidable task. One approach that can help to clarify the decision process is to define the criteria that are most important to the individual user and then apply those criteria to each tool that is under consideration.

Defining the Criteria

What criteria are most important will vary among specific translators, localization firms and other users of the tools. For some people, the deciding factor may be a WYSIWYG interface for HTML files, such as that of TRADOS' TagEditor or SDLX. Others require the possibility of including several file types in one translation project as in Star Transit. Still others may have a need for the simultaneous use of multiple memory databases, a feature that TransSuite 2000 offers.

I have chosen a set of criteria that are the most relevant for my specific work as a translator and training specialist. I will apply them here to Déjà Vu from Atril, a tool that I know well since I train people in its use. First, the tool must provide adequate filters for the great variety of formats, including the static "traditional formats," tagged formats and database formats. Next, it has to be able to batch-process a very large number of files and maintain complex folder structures. Last, and perhaps most important, it has to provide a flexible, "non-static" approach that will allow for an on-the-fly integration of newly developed formats and codes.



Importing Access 97 databases.

Applying the Criteria

Déjà Vu -- along with most other translation database tools to varying degrees -- supports all the traditional formats (Word, RTF, FrameMaker, Interleaf, PageMaker, QuarkXPress, PowerPoint, C/C++, RC and so on). Déjà Vu can also handle many other formats that are traditionally not supported by most translation database tools, including Excel and Java Properties and files that have been processed in some other database tools.

Even more relevant in the context of this article are Déjà's capabilities in dealing with Web site-related file formats. As almost all other translation database tools do, Déjà Vu supports HTML. So far so good, but especially when it comes to any kind of interactive Web sites -- which, believe it or not, represent the vast majority of sites -- translators can no longer get very far with pure HTML-filtering capabilities. First of all, a variety of scripting formats have to be supported, most commonly ASP, JavaScript or VBScript. Déjà Vu allows the user to refine the HTML filter by assigning a "subfilter" for scripting languages, including ASP, JavaScript and VBScript, which will then parse the respective files using these new instructions.

The other relevant question in this context is the question of database-driven Web sites. One notable option is Access 97 databases. In Déjà Vu, users can work directly in databases. The problem with databases is that they typically contain several tables with numerous fields (columns), of which perhaps only one or two will have to be translated. How does a translation database tool distinguish translatable from non-translatable fields? In this tool, it is possible to define exactly which column in what table is to be translated, what kind of content the translatable field will have (plain text, HTML/ASP, VBScript or JavaScript) and where the output of this field is supposed to be. The translator can choose whether to overwrite the existing text with a "Same column" option or to insert it into a different column with a "t_content" option.

The tool provides direct access to this widely used format and also to numerous other database formats that can either be exported to the Access MDB format or to the CSV (comma separated value) format, which then can be easily imported into Access 97 databases and then processed in Déjà Vu. Once the translation is finished, the user can export the translation into a CSV format again.

What happens with batch processing? Déjà Vu is built around batch processing. Experienced users frequently process several hundred or even thousands of files from one folder and all its subfolders simultaneously. This allows the user to work on the language material in one large file rather than having to open, save and close hundreds of individual files. The real strength in the way Déjà Vu handles its batch processes, though, is that it perfectly maintains any kind of file structure once the user exports the file -- a crucial aspect for Web site translation.



Imported HTML file (source column).

The final threshold for Déjà Vu to cross in this evaluation is the last condition -- flexibility to deal with ever-changing or poorly defined scripting standards. The ingenious and highly technical solution that Déjà Vu offers to this is most likely not something for the average translator, but certainly something that will set the hearts of project managers aflutter.

With the help of very straightforward, regular expressions (that is, for technical users or for users who are willing to take the time to understand the fairly good explanations provided on Atril's Web site), users can override or modify the filters that Déjà Vu applies to HTML files.

Consider the example of an imported HTML file that contains JavaScript which was misinterpreted by Déjà Vu's default HTML/JavaScript filter: The source column includes terms such as gnavproducts, gnavservices and gnavsearch.

Even to the inexperienced user it will seem obvious that all the terms beginning with gnav are not to be translated. The context view that Déjà Vu provides verifies that these terms are indeed code.



Context view of a source sentence.

Modifying any of these terms could cause a malfunction of the Web site and must therefore be strictly avoided. Though Déjà Vu provides a lock option that allows the translator or project manager to prevent rows from being translated, it would be easier not to have these strings imported into the Déjà Vu project to start with. This can be achieved by writing a simple text file, naming it HTMHide.txt, and saving that file into the root directory in which the source files are stored. The contents of that file will use the following pattern:

Pre _ StringsToHide _ Post _ StringsToDisplay

(Pre is whatever precedes the strings to be hidden, StringsToHide are the actual strings to be hidden, Post is whatever comes after the strings, and StringsToDisplay are exceptions to the prior rules.)

The text file for this particular HTML file would contain the following string:

_ gnav.* _ _

(Note that _ represents tab characters before and after gnav.*)

As a result, all strings that start with gnav would be excluded, and the imported file would display correctly. These customizable options are only to be used for the relatively small number of HTML-related files that do not provide a clean import with the provided filters. Any Web site developer, programmer or project manager will understand the potential that such customizability of an HTML filter offers.



Refined HTML import (source column).

The rules and processes of the localization industry have indeed changed -- new, dynamic formats must be supported, the turnaround time has decreased rapidly and batch processing has become essential. Each user will have specific needs that will change over time. The architecture of the best translation database tools and the farsightedness of their development teams will allow the tools to respond to these new challenges, adapt to new requirements and propel translation database tools into today's rapidly changing era of localization.


Jost Zetzsche is a translator and localization consultant, and a co-founder of the Oregon company International Writers' Group, LLC. He can be reached at jzetzsche@internationalwriters.com.
--MultiLingual Press, reprinted with permission of translationzone.com

 

International Writers' Group || Déjà Vu Support

©1999-2005 International Writers' Group, LLC