[Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3

22 Apr 2003 15:53:25 -0400

On Wed, 2003-04-16 at 18:07, "Martin v. Löwis" wrote:

> > So why isn't the English/US-ASCII bias for msgids considered a liability
> > for gettext?  Do non-English programmers not want to use native literals
> > in their source code?
> 
> Using English for msgids is about the only way to get translation. 
> Finding a Turkish speaker who can translate from Spanish is 
> *significantly* more difficult than starting from English; if you were 
> starting from, say, Chinese, and going to Hebrew might just be impossible.
> 
> So any programmer who seriously wants to have his software translated 
> will put English texts into the source code. Non-English literals are 
> only used if l10n is not an issue.

That's probably true.  I'm just not sure Zope wants to make that a
requirement.

> > BTW, I believe that if all your msgids /are/ us-ascii, you should be
> > able to ignore this change and have it works backwards compatibly.
> 
> "This" change being addition of the "coerce" argument? If you think
> you will need it, we can leave it in.

Actually, thinking about this more, we probably don't even need the
coerce flag.  If all your msgids are us-ascii, you don't care whether
they've been coerced to Unicode or not because they'll still compare
equal.

So I propose to remove the coerce flag, but still Unicode-ify both
msgids and msgstrs.  Then .ugettext() will just return the Unicode
msgstr in the catalog, while .gettext() will encode it to an 8-bit
string based on the charset.  Personally, I think most i18n Python apps
are going to want to use .ugettext() anyway, so for the average program
this will just work as expected.

I have the tests passing for this change.  Any objections?

> >>If the msgids are UTF-8, with non-ASCII characters C-escaped,
> >>translators will *still* put non-UTF-8 encodings into the catalogs.
> >>This will then be a problem: The catalog encoding won't be UTF-8,
> >>and you can't process the msgids.
> > 
> > Isn't this just another validation step to run on the .po files?  There
> > are already several ways translators can (and do!) make mistakes, so we
> > already have to validate the files anyway.
> 
> I'm not sure how exactly a validation step would be executed. Would that
> step simply verify that the encoding of a catalog is UTF-8? That 
> validation step would fail for catalogs that legally use other charsets.

The validation step would make sure that all the msgids and msgstrs
could be decoded using the encoding claimed in the headers.  If msgids
are us-ascii then (just about) any other encoding for msgstrs should
work just fine.  If there are non-ascii in both msgids and msgstrs, then
some common encoding would have to be used (what other than utf-8?). 
It's a choice left up to the application and its translators.

-Barry