[I18n-sig] Re: pygettext dilemma

François Pinard pinard@iro.umontreal.ca
07 Aug 2001 16:38:05 -0400


[Barry A. Warsaw]

> In Mailman, I've got a bunch of normal .py modules and a bunch of
> command line scripts.  The modules have their translatable strings
> nicely marked with _() and only those strings should be extracted.

Hello, Barry.  Long time no talk! :-)

`_(STRING)' is two-fold.  First, it marks STRING for extraction and later
insertion in some generated POT file.  Second, it is a nickname for the
`gettext' function or alike, that will translate STRING at run time given
that a translation file provides a translation.

Experience taught us that this is not always adequate.  We sometimes need
to delay a translation.  That is, we might use `_(VARIABLE)', with VARIABLE
being first assigned some translatable string elsewhere in the program.
Since VARIABLE is not a string, it does not get extracted into a POT file.
But those strings which could get assigned to VARIABLE are not extracted
either, because they are not marked.  You understand that they were marked
with `_(STRING)', they would get translated prematurely.

All this to say that there is a need for marking strings in such a way that
they will be extracted into POT files, but otherwise untouched by Python.
That is, the way to mark string should be a Python no-operation, and ideally,
should not alter the Python language.

The only simple Python no-operation I know is the unary prefix `+', and my
intuition tells me that it might have been dangerous to use it for marking
delayed translation strings.  Using prefixes like i"STRING" or t"STRING"
(for "i"nternationalisable or "t"ranslatable) would require a modification
to Python.

So, I came with the simple idea to play a bit with the fact that Python folds
a succession of constant strings into a single one at compilation time.
The idea is to prefix a translatable string, when it is used outside the
usual `_(STRING)' idiom, by an empty string of the other kind, like this:

   Exemple         Type         For extractor

   'TEXT'          1-quoted     not marked
   "TEXT"          2-quoted     not marked
   '''TEXT'''      3-quoted     not marked
   ''"TEXT"        4-quoted     marked
   ""'TEXT'        5-quoted     marked
   """TEXT"""      6-quoted     not marked
   ""'''TEXT'''    7-quoted     marked
   ''"""TEXT"""    8-quoted     marked

Of course, the idea of using the empty string "of the other kind" is to
avoid ambiguity: prefixing '' to 'TEXT' would produce '''TEXT', which just
cannot work.  I agree that for 7-quoted and 8-quoted strings, it is not
really required to use the empty string of the other kind, using an empty
string of the same kind would work without problem.  I suggest we keep
"of the other kind" for 7-quoted and 8-quoted for being more consistent.

> The scripts however should have both _() and docstrings extracted,
> since the module docstrings include usage text.

In fact, I think that even within a single module, some docstrings should
be considered translatable, while some other docstrings should not be.
Considering the choice has to be per whole module at a time, is too gross.
This goes almost without saying.  One should not feel compelled to avoid
docstrings for internal or service functions within a module, merely to
avoid having them spuriously extracted, and later, uselessly translated.

> Does anybody have any suggestions or better ideas?

I would be tempted to suggest that we merely use delayed string marking,
using the convention above (like in 4-quoted, 5-quoted, 7-quoted or 8-quoted)
for docstrings meant to be translated.  Such strings would be extracted
no matter what, in docstring position of not.

An option to `pygettext' might exist to extract all docstrings, whether
marked as delayed strings or not, but I would guess this is an interim
solution which is not to be satisfying in the long term.  Best is to
mark translatable strings precisely, either using immediate `_(STRING)'
or delayed translation.

One problem is that Python does not seem to automatically concatenate a
sequence of strings as a single one, when in docstring position.  We might
consider this as a Python bug: repairing that bug would not really change
the language, and would allow delayed marking of translation strings.

Let me present the set of suggestions, in this message, as having a minimal
impact on Python, yet being pretty flexible in what it would allow us to do.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard