Drupal's multilingual problem - why t() is the wrong answer

Drupal is a great system to run foreign language websites on. The core itself is written in English and modules and themes are expected to follow suit. For developers, very simple wrapper functions are available to mark your translatable strings and let Drupal translate them to whatever language needed. These are the famous t(), the less famous format_plural() and a whole family of other functions. See my cheat sheet (PDF) and the drupal.org documentation for more on this.

Then there is "the other side", whatever does not come from code. Drupal works pretty well and very consistent if you want all of those to be in a foreign language (i.e. not English), but not in multiple languages (any of which can be English at that point). Drupal only has direct multilingual support in nodes (+ fields of entitites) and for path aliases. But life with Drupal means you work with all kinds of other objects like blocks, views, rules, content types, etc which are not "language-aware".

Unfortunately for building multilingual Drupal sites, this is the biggest problem that needs to be worked around. The contributed Internationalization module attempts to fill in the gaps, provide language associations and different workflows for translating these language-unaware objects. This works to some degree, but is really not easy without much help from the modules implementing these objects.

Go the easy way with t() - wreck your ship

Module developers are well aware that if they call t() with a string, it should give back the translated string. It is so easy, and tempting to use for all text translation. So some module developers do use t() that way. There are even some examples of this in Drupal core, which we are working to remove. The field system for example lets users specify the field label, help/description, allowed values and the default value for fields. Now to translate all these, one could think t() is a nice and easy solution. In fact, Drupal core was using t() for some cases of label display and some cases of description display (but not for allowed values and default values). This was recently removed from field constructs in Drupal 7, so code that was using the t() system for translation of field properties will not work like that in Drupal 7.1.

Using t() for user provided data is a very bad idea, and it comes from the simplicity of it. It merely relates one source string to other translation strings. Let's imagine the field system would use t() for all the four properties mentioned of a field. What would happen?

  1. Timing problem: t() collects translatable strings for storage in the database when it is invoked. Therefore the values are only available for translation once the form is displayed with all the help text, label, default value, allowed values, and so on. For fields, this requires the user to first navigate to the entity form so the source text is saved. For other types of objects where some strings are conditionally displayed (think views empty help text), this can be pretty hard to achieve.
  2. Source string change problem: t() will store the translations related to the source string. If the source string changes (eg. you fix a typo in your long help text), all your translations for that string are lost, and you need to redo them all over again.
  3. Source language assumption problem: t() assumes your source text is in English, and it will not let you translate it to English. Now if your site is not primarily built in English, you'll not be able to translate your custom objects to English. You'd need to set up a fake secondary English language to be able to.
  4. Overall UX problem: When you pass a string through t(), it is saved at a very generic place which just relates strings to others. It does not know that your string was a field label or help text, or that a string was the field label for a field for which another string was the help text. You cannot translate the four field properties at once, because there is no meta information involved to relate them together. (You can optionally specify a context for your strings, which somewhat mitigates this problem, but your strings are still translated at an entirely different place to where you edit the originals and filtering by context is still a major stumbling block for users, just try it.)
  5. Individual UX problem: Many strings in objects have widgets associated with them, like a format selector, a WYSIWYG editor, a dropdown, etc. Now there are no such widgets on the translation user interface. There are (almost) no allowed value limits. Some source strings (such as default value for a body field) have format assigned to them. Even if the default value can be full HTML, the t() backend will not let translators submit translations in that format, and will not provide the right widget to translate it.
  6. Permission escalation via formats: It is not just a UX problem, also a permission problem. If the default value has a PHP format, and you can edit its translation, you could theoretically inject PHP code to the site merely with 'translate interface' permissions. Well, the t() backend will filter the text for some XSS input, but it is entirely unaware of formats and permissions associated to them.
  7. Cross-object permission escalation: t() just stores generic strings, if you use it to translate properties of all your objects, translators can manipulate your objects without permission to create or edit them. With the field example, if the allowed values are translatable via t(), translators can edit the allowed values of a field (in a language context), add new ones, remove some without having the permission to actually tinker with fields. This might or might not be the desired behavior depending on your site.
  8. Workflow problem: translations either exist or they don't. There are no unpublished, in review, etc. translations with t(). You cannot preview how it would look like with your translations before you save it.
  9. Performance problem: the t() system preloads short strings for quick translation on the page. While this might not be a big performance problem with field properties, think about what would happen if we'd use t() for taxonomy terms or other data of big quantities. Since these are usually short strings, the t() string cache will load the translations for all of them on the page. On the other hand, since we have no object meta-information for these strings, we can only load the longer ones one by one when needed. This means SQL queries for the field help text, allowed value and default value each - multiplied by the number of fields displayed.

The ideal system

Now let's see what an ideal system would do to avoid these issues and provide a generic translation service for objects in general (views, blocks, field properties, etc).

  1. Timing problem: creation, updates and deletion of the object in question should save its translatable as available for translation.
  2. Source string change problem: instead of using the source string as the key, we should introduce a string identifier; Java, .NET and anybody else does this for a pretty long time now - see property and resource files.
  3. Source language assumption problem: whenever we save an object, we should know and store what language did we save the object in (which will filter down to its text properties); we don't know this now for blocks, views, field properties, your contact form configuration, your anonymous username, your default date format, site name, content type labels, etc. - this is a major missing piece
  4. Overall UX problem: we should be able to generate in-place translation forms based on the string identifier (for which we need an index of objects with their properties to tell which are translatable)
  5. Individual UX problem: we should be able to look up the widget to use from aforementioned index
  6. Permission escalation via formats: this should not happen once we know the widget and related metadata from the index about the object
  7. Cross-object permission escalation: this is a tough one, because it makes cross-object translation harder; I think we should avoid it by default by focusing people on the in-place translation tools and allow for it if not an issue or specifically required for the site
  8. Workflow problem: this can be solved by implementing translation sets for more objects; like the Drupal core node translation module, that could allow for previews, workflows, even more fine grained permissions, etc. for the given object - opening a whole new set of features
  9. Performance problem: translated object properties need different loading patterns implemented compared to simple string translation; this needs to be explored and probably different, overridable implementations provided for different use cases

Translation is a rendering operation

Ok now this was all about saving the translatable and editing the translation. How do we actually display it? Well, that part was actually not a problem with using t(), since if you used t() consistently, it should work for display (even if it is a big mistake for all kinds of other reasons). However, this lets us learn another lesson: translation is a display/rendering task. When you use translatable fields in core for example (actual field values, not the field labels and friends mentioned above), your translations will be under the node, the right one being used depending on which version should be displayed. When you send hundrends of notification emails in a request, the right language translation of the value will be loaded for each email (users can have very different language preferences). In a similar sense, translating the strings should be a rendering operation.

There was considerable (but not really well known) work in the Drupal 7 core development cycle to tag database queries with 'translatable' and potentially implement contributed solutions to translate data right in the object loading phase. One of these experiments is in Berdir's sandbox at http://drupal.org/sandbox/berdir/1122562. In reality, while I think this approach fares good for performance (and solves the source string change problem), it does not solve any of the other problems. It works by making multiple copies of data tables each per language (think to translate menus, you'd have as many menu tables as many languages you need). However, the code still needs to make original values available for translation (timing problem), it has no idea of the original source language, there is no UX plan in place that I know which would not require similar code support from each module defining the objects, the individual UX problem is there, and permissions and workflows would depend on the currently missing implementation. However, it introduces new problems by multiplying tables in your database and making your object translated in loading, which can be a problem if you re-save the object and overwrite the original with translated values, or if you need different language versions of the same object on a page (for which you need to make the load function language-aware, which requires the original objects to be language aware as well).

All-in-all I don't think it would be possible to escape assigning language to objects and defining metainformation of the objects to support building translation user interfaces and workflows as well as handle permissions properly. Neither using t(), nor just tagging your queries and then assuming someone else will take care it for you cuts it.

So what can aspiring contributed module developers do to multilingual-enable their objects? I'd like to do a rundown of the current i18n_string module API in the next post, from where I hope we can brainstorm on how to simplify and structure it further, and then form Drupal 8 plans either based on that or some other solution to the above detailed problems / goals.

This post cross-posted at http://groups.drupal.org/node/149984, please comment there to keep the discussion at one place.