Marking translatable strings

Thu Sep 16 09:27:02 EDT 1999

Hello, people.

I had a real strange idea :-).  I first quickly dismiss it, but it is so
simple that I prefer to ponder it again, and share and debate it a little
first, maybe.

My friends know that I've been working at software internationalisation for
many years now, with a stress on program messages.  Of course, I want to
include Python in the realm of my possibilities, and myself start writing
internationalised scripts soon, in such a way that everything links nicely
with the Translation Project.

So, I want the big picture right now.  That is: a technique for marking
strings for automatic extraction and building of PO files, and a technique
for using PO files from within Python scripts.  I foresee that Python
introduces an usual difficulty in that the textual domain for translations
may vary quite unexpectedly, when the control dynamically flies between
independent packages under different textual domains.

I started a discussion with Guido about this, but I'm a slow thinker, and
would not like to rush things before feeling rather solid, as Guido's time
is precious.  But on this forum, I thought I could dare exploring ideas,
asking for your forgiveness for any blunder I could make while thinking. :-)

1) Marking strings
   ---------------

There are many circumstances where strings translation could be delayed
from where they textually occur, and that language syntactic considerations
could make the marking difficult in a few cases.  In C, there is a
pre-processor between the sources and the compiler, so it is easy to introduce
an identity macro which special name is recognised by the string extractor,
and which vanishes before the compiler sees it.  We use:

   #define N_(Text) Text

for that purpose.  In languages where such preprocessing is difficult, like
for `ksh' or `bash', strings are especially marked in the syntax, like in:

   $"translatable text"

but this requires a modification to the interpreter.  Emacs, with `defsubst',
also allows for macro expansion, and we then use tricks as in C (anywhere
except for doc strings).  Some other flavours of LISP are also open to
such tricks.

Python has no preprocessing, no special string syntax for markability,
and moreover, it has doc strings!  So, at first glance, it looks difficult.
However, and this is where my strange idea comes to play, it has eight
type of strings: ', ", ''', """, r', r", r''' and r"""; and I thought
that maybe we could just discipline ourselves to give more meaning to all
these differences, since after all, if we except some ending backslash
considerations, all eight types are equally capable of representing
any string.

I'm a strange, anal man, who needs a reason behind the slightest choices,
and believe me, eight types of strings gave me a lot of food for this
mania, all along while writing.  I'm still exploring! :-) Yet, after having
played with Python for almost 10 days, now, I came to realise that I'm more
naturally tempted to stick to the 'TEXT' notation for computer strings and
"TEXT" notation for human strings, the reasons being that there might
be a lot of apostrophes in human text, and that traditionally, we more
usually quote sentences with "TEXT", while we quote words with `TEXT'
(note the grave accent at the left).

So, the bizarre idea I got is that one could be to formalize this into
a rule: strings of type ", """, r" and r""" could be all markable as
translatable, while strings of type ', ''', r' and r''' would not be.
On the other hand, this might be overkill, as maybe people are used to
freely mix types ' and ", and this change could be seen as stressful.
Could we choose better?

Surely, since doc strings use """ exclusively, there is no choice as to retain
type """ for translability, wherever it appears.  However, forcing the use
of """ everywhere we want translatibility is an overhead of four characters
(just compare "TEXT" with """TEXT"""), while C use three or four characters
(compare "TEXT" with _("TEXT") or N_("TEXT"), and bash uses only one (compare
"TEXT" with $"TEXT").  I would like Python to be as comfortable as possible.
If I could plainly use "TEXT" instead of 'TEXT' to mark translatability,
I would have an overhead of zero characters, which would be better than
everything, but I'm not sure if this constraint would be acceptable to
Python writers.

Another possibility is to use ''"TEXT" instead of "TEXT", making an overhead
of two characters: that is the compile time concatenation of '' with "TEXT".
This combination is quite unlikely to me, and a bit uglier.

2) Translating strings
   -------------------

(Oops, I just received a phone call forcing me to leave fairly soon, so I
have to be very concise for the remainder of this message.  Let's rather
develop these in the possible thread that might follow from this message.)

What would be the most comfortable for me, short of having the Python
interpreter modified, is to merely use a function to force the actual
translation of a string.  The most comfortable (the less intrusive) way
would be to call:

        _(TEXT)

to get the translation of text.  It resembles C, but it overloads `_',
which already has a preset meaning, interactively.  If I could push the
preset `_' somewhere else, maybe on `__', I would do it and reserve `_'
for translation, which would be much, much more common in the long run.

Using a function would allow us to build the whole translation chain
(administrating the translations with teams, etc.), yet if the syntax
could be relieved with the help of Guido, I guess this would be welcome.
We might need to experiment first.

3) Setting the textual domain
   --------------------------

In a quick word, I guess that this problem could be fairly easily solved
through the handy scope rules for resolution of names in Python.  Each module
could have a standard global variable name setting which textual domain to
use within it.  So, even with the control flying like hell between modules,
it would not be a problem on average.  But there are problematic cases,
like for when untranslated strings are transmitted to other modules, for
being translated there, or even maybe for plain doc strings.  This requires
good thought.  This problem is more difficult that many might thing at first.

OK, I have to rush away now.  Thanks for listening! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard