[Ocaml-i18n] proposal: message catalogue system

Benjamin Geer ben at socialtools.net
Tue Dec 2 10:53:03 PST 2003


I should introduce myself briefly:

I'm a programmer with a background mostly in Java, and I started 
programming in Caml this year. I have an M.A. in linguistics; I speak 
English and French, and I'm learning Italian and Arabic.

I've been thinking about writing a message catalogue system for Caml. 
My motivation is that I need it for a web application, but I'm keen to 
have it be suitable for Caml programs in general. I emailed some ideas 
to Matthieu Sozeau (author of OCamlI18n); he suggested that we continue 
the discussion on this list. He pointed me to an article about Perl's 
Maketext:

http://www.icewalkers.com/Perl/5.8.0/lib/Locale/Maketext/TPJ13.html

I think it makes a lot of good points. I think its main insight is that 
a translation in a message catalogue can be thought of as a function 
that returns a string in a particular language, often including some 
data that was passed to it.

Since it's a function, the next question is: what language should it be 
written in? Locale::Maketext provides two languages: a simple 'bracket 
notation', and Perl itself.

In Java, java.text.MessageFormat provides a bracket notation, with no 
ability to fall back to anything more powerful. This looks to me like a 
serious limitation. For example, its approach to plurals (in 
java.text.ChoiceFormat) only allows you to specify different forms for 
absolute ranges of quantities; this doesn't seem to be able to handle 
Slavic-style plurals (see the Polish example in the GNU gettext manual: 
http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150).

It seems to me that bracket notations are appealing because they allow 
you to express the translation function as an *exemplar*, which is 
simple and intuitive:

Your search returned {0} files in {1} directories.

However, the bracket notations provided by java.text.MessageFormat and 
Locale::Maketext are not powerful enough to handle the more complex 
logic that is necessary in order to generate plurals in some natural 
languages. The solution to this problem in Maketext is to fall back to 
using a different programming language entirely, in this case Perl. But 
then the translator needs to learn two syntaxes; one of these is too 
simple for the task at hand, and the other one is too complex. Wouldn't 
it be nice if the translator could learn just one syntax, which allowed 
him to express the translation as an exemplar, and which was also 
powerful enough to handle the logic for plurals?

It seems to me that what's needed here is a template language. I've 
written one, called CamlTemplate (http://saucecode.org/camltemplate). 
For simple exemplars, it's as easy to use as bracket notation, e.g.:

Cannot open file ${a}.

Getting back to the issue of plurals, suppose your message just has to 
say "n files", where n is a number. The English template could be:

#if (a == 1)
   1 file
#else
   ${a} files
#end

The Polish one could be:

#if (a == 1)
   1 plik
#elseif (a >= 5 && a <= 21)
   ${a} plików
#elseif (a % 10 >= 2 && a % 10 <= 4)
   ${a} pliki
#end

The article on Maketext suggests writing Perl functions to generate 
plurals, and calling these functions from the translations. This could
be done in CamlTemplate as well, e.g. for English:

#macro quant(num, word)
   ${a}
   #if (a == 1)
     ${word}
   #else
     ${word}s
   #end
#end

The English "n files" template would then become:

#quant(a, "file")

Of course you could expand this to handle the common irregular forms.
Since a CamlTemplate template can call Caml functions, simple
string-matching functions could be provided to do things like this:

#macro quant(num, word)
   ${a}
   #if (a == 1)
     ${word}
   #elseif (endsWith(word, "y"))
     ${stripSuffix(word, "y")}ies
   #else
     ${word}s
   #end
#end

The next question is: how do you access a translation from a program? 
What do you use for a message key? Gettext uses the message itself in 
some natural language (the one the programmer used); it reads message 
directly from program source code. I think this has several drawbacks:

1. If the same message is used several times in the program, when it 
changes, it must be changed in several places.

2. Representing the message as an exemplar might be complex in itself, 
as in the examples above, thus complicating the program.

3. The programmer might not be the person who writes the messages; 
having them in the source code is therefore an inconvenience, 
particularly if they need to be written before the programmer starts coding.

The alternative is to use some arbitrary string as a key; this is the 
approach taken by java.text.MessageFormat. I think it's a more 
maintainable approach, because messages can be changed without changing 
program source code.

Another question is: how do we store message catalogues? The problem of 
character encoding comes up right away. I suggest that we store them in 
XML files, because XML has built-in support for dealing with encodings. 
So I propose something like this:

<?xml version="1.0" encoding="UTF-8"?>
<catalog lang="en">
   <macros>
     #macro quant(num, word)
       ${a}
       #if (a == 1)
         ${word}
       #elseif (endsWith(word, "y"))
         ${stripSuffix(word, "y")}ies
       #else
         ${word}s
       #end
     #end
   </macros>

   <messages>
     <message key="disk_full">Disk ${a} is full.</message>
     <message key="files_in_dirs">There are #quant(a, "file") in 
#quant(a1, "directory").</message>
   </messages>
</catalog>

This could just be the default way of storing them; there could also be 
an interface allowing catalogues to be loaded from any other source.

So to get a translation in a Caml program, I'm proposing a function like 
this:

val msg : key:string -> args:string list -> string = <fun>

You could use it like this:

msg "files_in_dirs" [ file_count, directory_count ]

If you were using CamlTemplate in a web application, you could also call 
this function in a template:

${msg("files_in_dirs", fileCount, directoryCount)}

The article about Maketext points out that you'll want to share some 
functions between languages, or at least between different variants of 
the same language. Currently in CamlTemplate all macros are global, i.e. 
can be used by all templates. I'm thinking about adding a simple 
namespace facility so that a template could be in, say the, "en.UK" 
namespace; when it used the #quant macro, the template engine would look 
for that macro in the "en.UK" namespace; if it didn't find it there, it 
would look in the "en" namespace, and then in the default namespace.

I think that would take care of all the issues raised by the Maketext 
article.

I'd love to hear some reactions to this proposal.

Ben





More information about the Ocaml-i18n mailing list