Linking translation tools and Drupal

By Gábor Hojtsy, 11 July, 2007

Having a great website management system like Drupal that has built-in content translation tools is an achievement in itself. But content is not always born in Drupal, and it’s most certainly not translated in Drupal. This makes it necessary, particularly in the context of multilingual websites, for Drupal to support interfacing so it can link in with external translation tools and their translation workflows.

Computer Aided Translation (CAT) supports translators in reusing previously translated text for new works and archives their current work so it can be reused in the future. This makes it ideal for hooking into website management systems to help with content translations. Without going into great detail, the OASIS XML Localization Interchange File Format (XLIFF) allows the interoperability between tools by defining a markup format and interchange language for localizable data. It’s also a well known standard in the CAT industry.

Most big players in the CAT industry support XLIFF, but until recently Drupal lacked even a basic ability to integrate into these workflows. Over the past few months, I’ve been searching for ways to fix this. I found Bryan Schnabel's XLIFF Roundtrip Tool that handles HTML to/from XLIFF conversions and integrated nicely into a Drupal module. While my XLIFF Tools module is more of a proof-of-concept than an industry proved implementation (so far at least), I welcome everybody interested to take a look and test the module with different CAT tools.

The philosophy behind CAT-based workflows is to extract resources from native formats and put them into a common standard localization format that is easier to use when building tools. The Gettext format and system that are used for interface localization in Drupal are ideal for interfacing text used in the application source code. However, translating user-generated content requires different translation tools and a common, reusable translation memory database. Using XLIFF, the translated resources are merged back into their native format when the translation is complete, and the results are stored in a translation memory. Filters and specifications for converting to and from XLIFF have been developed for a number of file types.

There are two types of mapping methods to choose from: a "minimalist" approach and a "maximalist" approach, as referred to by XLIFF standards. Here is a look at the minimalist approach.

The major difference between the two is in how markup information is retained throughout the translation process. The minimalist approach requires a skeleton generated from the original document and only the translatable resources extracted to XLIFF (possibly with some inline markup). With the maximalist approach, however, all structural and inline markup is encoded in the XLIFF document, and no skeleton is used.

The way the process works is that the extracted text is pre-translated from the previously collected translation memory, then reviewed and fixed by a human translator. The resulting translations are stored in the translation memory and a reverse conversion takes place to generate the translated document (possibly using a skeleton if available). The hero image of the post provides a look at a Drupal XLIFF integration using the maximalist approach.

One of the best parts about interfacing with tools like these is that Drupal only needs to do the conversion and reverse conversion. The other parts of the work are done outside of the web site with more tailored tools.

This post was originally published as a guest post on the Development Seed blog. It is not available there anymore, so archived here for posterity.