[Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3

11 Apr 2003 16:26:59 -0400

On Fri, 2003-04-11 at 15:54, "Martin v. Löwis" wrote:
> Barry Warsaw wrote:
> 
> > - Set the default charset to iso-8859-1.  It used to be None, which
> > would cause problems with .ugettext() if the file had no charset
> > parameter.  Arguably, the po/mo file would be broken, but I still think
> > iso-8859-1 is a reasonable default.
> 
> I'm -1 here. Why do you think it is a reasonable default?
> 
> Errors should never pass silently.
> Unless explicitly silenced.
> 
> While iso-8859-1 might be a reasonable default in other application
> domains, in the context of non-English text (which it typically is),
> assuming Latin-1 is bound to create mojibake.

Okay, never mind, I'll back this one out.  The problem was caused by my
other patch to unicode-ify on read (see below) without first having a
charset.  I have a different fix for this.

> > - Add a "coerce" default argument to GNUTranslations's constructor.  The
> > reason for this is that in Zope, we want all msgids and msgstrs to be
> > Unicode.  For the latter, we could use .ugettext() but there isn't
> > currently a mechanism for Unicode-ifying msgids.
> 
> Could you please in what context this is needed? msgids are ASCII, and
> you can pass a Unicode string to ugettext just fine.

In Zope, all strings are Unicode and the catalog may include messages
that are extracted from places other than Python source code, e.g.
XML-based files.  Message ids can contain non-ASCII characters if they
are written by a non-English coder.  I think in that case, we'd want to
do something like encode the strings possibly with utf-8 for the .po/.mo
files, but we want them decoded in time to look the Unicode strings up
in the catalog.

Similarly, what happens if a non-English coder writes an i18n'd Python
module with native strings, possibly using a Python 2.3 coding cookie. 
We'd want their message ids to be extracted into the .mo/.po files,
right?

> > The plan then is that the charset parameter specifies the encoding for
> > both the msgids and msgstrs, and both are decoded to Unicode when read. 
> > For example, we might encode po files with utf-8. I think the GNU
> > gettext tools don't care.
> 
> They complain loudly if they find bytes > 127 in the msgid.

Really?  Ok, I'm still confused because I tried the following example:

I wrote a .mo file (charset=utf-8) with the following record:

#: nofile:0
msgid "ab\xc3\x9e"
msgstr "\xc2\xa4yz"

I used standard msgfmt to turn that into a .mo file.  Then created a
GNUTranslation(fp, coerce=True) and called

>>> t.ugettext(u'ab\xde')
u'\xa4yz'

This is what I should expect, right? ;)

> > - A few other minor changes from the Zope project, including asserting
> > that a zero-length msgid must have a Project-ID-Version header for it to
> > be counted as the metadata record.
> 
> That test was there, and removed on request of Bruno Haible, the GNU
> gettext maintainer, as he points out that Project-ID-Version is not
> mandatory for the metadata (see Patch #700839).

Ah, I read the diff backwards in this case.  I'll back this one out too.

-Barry