[I18n-sig] Re: pygettext dilemma

Barry A. Warsaw barry@zope.com
Sun, 12 Aug 2001 23:42:57 -0400


>>>>> "FP" =3D=3D Fran=E7ois Pinard <pinard@iro.umontreal.ca> writes:

    >> In Mailman, I've got a bunch of normal .py modules and a bunch
    >> of command line scripts.  The modules have their translatable
    >> strings nicely marked with _() and only those strings should be
    >> extracted.

    FP> Hello, Barry.  Long time no talk! :-)

Indeed!  BTW, I18N Mailman is coming along very nicely now.  I hope
the 2.1 release will happen within the next few months.

    FP> `_(STRING)' is two-fold.  First, it marks STRING for
    FP> extraction and later insertion in some generated POT file.
    FP> Second, it is a nickname for the `gettext' function or alike,
    FP> that will translate STRING at run time given that a
    FP> translation file provides a translation.

    FP> Experience taught us that this is not always adequate.  We
    FP> sometimes need to delay a translation.  That is, we might use
    FP> `_(VARIABLE)', with VARIABLE being first assigned some
    FP> translatable string elsewhere in the program.  Since VARIABLE
    FP> is not a string, it does not get extracted into a POT file.
    FP> But those strings which could get assigned to VARIABLE are not
    FP> extracted either, because they are not marked.  You understand
    FP> that they were marked with `_(STRING)', they would get
    FP> translated prematurely.

    FP> All this to say that there is a need for marking strings in
    FP> such a way that they will be extracted into POT files, but
    FP> otherwise untouched by Python.  That is, the way to mark
    FP> string should be a Python no-operation, and ideally, should
    FP> not alter the Python language.

All the above is true, and I have encountered these situations in
Mailman 2.1.  Python, however, provides a very nice solution, quite in
keeping with the Pythonic "explicit-is-better-than-implicit" mantra.

What I do in this situation is to temporarily bind _() to a no-op
function so that the string is marked for extraction, but not
translated in place.  E.g.

    import gettext

    def _(s):
        return s

    foo =3D _('extract this string but do not translate it yet')

    _ =3D gettext.gettext

This works perfectly because Python doesn't suffer from the same
deficiencies as C (i.e. the C pre-processor :).

    FP> The only simple Python no-operation I know is the unary prefix
    FP> `+', and my intuition tells me that it might have been
    FP> dangerous to use it for marking delayed translation strings.
    FP> Using prefixes like i"STRING" or t"STRING" (for
    FP> "i"nternationalisable or "t"ranslatable) would require a
    FP> modification to Python.

Right.  A string-prefix character as another disadvantage; it sets a
bad precedence for explosion of combinations of prefixes (i.e. we'd
now need rt'' strings tr'' strings utr'' strings tru'' strings,
etc. etc.).  So we agree that prefixes are out. :)

    FP> So, I came with the simple idea to play a bit with the fact
    FP> that Python folds a succession of constant strings into a
    FP> single one at compilation time.  The idea is to prefix a
    FP> translatable string, when it is used outside the usual
    FP> `_(STRING)' idiom, by an empty string of the other kind, like
    FP> this:

    FP>    Exemple Type For extractor

    |    'TEXT'          1-quoted     not marked
    |    "TEXT"          2-quoted     not marked
    |    '''TEXT'''      3-quoted     not marked
    |    ''"TEXT"        4-quoted     marked
    |    ""'TEXT'        5-quoted     marked
    |    """TEXT"""      6-quoted     not marked
    |    ""'''TEXT'''    7-quoted     marked
    |    ''"""TEXT"""    8-quoted     marked

This has been brought up before, and I know that some people really
like this approach.  I don't though, because 1) it is too magical; 2)
the rules are arbitrary and hard to remember; 3) explicit is better
than implicit.

When a newbie looks at a bit of Python code that looks like

    _('Traditional Chinese')

and wonders what this does, he should immediately look for the
definition of the _() function.  Using his well-honed Python skills
he'll look for some def or import that brings this function name into
scope, and this should naturally lead to purpose of the idiom.
E.g. they'll see "from gettext import gettext as _" or some such.

Seeing something like an unadorned ""'Traditional Chinese' really
gives no clue as to the purpose of this strange markup, so it would
have to either be something the reader of the code Just Got, or it
would have to be described in a comment, and that's simply
unfeasible.  I also claim that the rules are fairly arbitrary and will
be hard to explain and remember.  It's not something that's learned
once and then ingrained.

    FP> In fact, I think that even within a single module, some
    FP> docstrings should be considered translatable, while some other
    FP> docstrings should not be.

True.
   =20
    FP> Considering the choice has to be per whole module at a time,
    FP> is too gross.  This goes almost without saying.

I personally don't feel like it's that big a problem.  So far, in my
experience the only docstrings that really need to be extracted are
module docstrings in command line scripts.  I've found it not to be
that big a deal to also extract class or function docstrings in those
files, since it doesn't add that much of a burden to the translator.
But my personal preference has been to limit the docstrings in such
files to just the module docstring, and use comments instead of
docstrings for functions or classes.  Or, you can sometimes do
something ugly like use explicit

    __doc__ =3D _('Here is a module docstring')

Not pretty, but also not common I think, so it doesn't concern me
much.  I could conceive of a convention where a leading comment before
a docstring could inhibit extraction of the following docstring, such
as:

    class Foo:
        # notranslate
=09'''Here is a docstring that should not be extracted or translated.''=
'

One of two approaches could happen: either pygettext.py could ignore
the following docstring and not stick it in the PO file (but I forget
if tokenize gets to see comments or not), or pygettext.py could add a
#. notranslate comment to the entry telling translators to skip this
entry.
   =20
    FP> Let me present the set of suggestions, in this message, as
    FP> having a minimal impact on Python, yet being pretty flexible
    FP> in what it would allow us to do.

I appreciate the suggestions Francois!  I think what we've got gives
us the best approach for Python programs.

Cheers,
-Barry