[Patches] [ python-Patches-568669 ] gettext module charset changes

Thu, 13 Jun 2002 13:55:05 -0700

Patches item #568669, was opened at 2002-06-13 22:13
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=568669&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Open
>Resolution: Rejected
Priority: 5
Submitted By: Barry A. Warsaw (bwarsaw)
>Assigned to: Barry A. Warsaw (bwarsaw)
Summary: gettext module charset changes

Initial Comment:
The GNU gettext docs make two recommendations: that the
source string to gettext() be in us-ascii, and that the
default output charset be in the locale's character
set.  I think the latter makes the most sense for our
ugettext() methods.

The attached patch sets the default character set to
us-ascii for NullTranslations.  For GNUTranslations,
the default character set is taken from the
Content-Type: header if given in the .po/.mo file,
otherwise it's taken from the default locale
information, if available.  It falls back to the base
class charset (by default us-ascii).

This patch also provides the following:

- add a set_charset() method to the NullTranslations
base class, so that it is easier to change the default
character set.  For symmetry, I also rename charset()
to get_charset() and keep the former for backwards
compatibility.

- convert Lib/test/test_gettext.py to unittest style
(sans the cvs rm of Lib/test/output/test_gettext which
we'll do separately)

- update the docs for all the code changes described above.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-06-13 22:53

Message:
Logged In: YES 
user_id=21627

Obtaining the locale's codeset by parsing environment
variables is bogus. For example, in most installations, the
codeset for de_DE@euro is iso-8859-15. However, this is
impossible to find out by just parsing the environment
variables.

Instead, the proper way is to use
locale._nl_langinfo(CODESET) where available. If that is not
available, the following heuristics could be applied:
- On Windows, it is "mbcs"
- On Unix, parse the environment variables

As for the actual usage of the charset, I think you
misinterpret the gettext recommendation: the result of
gettext ought to be in the locale's encoding (this is not a
default encoding). This means that, if the codeset of the
locale and the charset of the catalog differ, character set
conversion needs to be invoked; I can see no traces of that
happening in your patch. 

The common case is a catalog in UTF-8, and the user's
codeset is language-specific (such as Latin-9). In that
case, conversion works well. There is also the case of
unsupported conversions (e.g. usage of EURO SIGN in the
catalog, but Latin-1 in the locale); in this case, glibc
iconv uses transliteration (to "EUR", in the example). Since
we have no transliteration, we would probably fall back to
return the string in the catalog's encoding :-(

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=568669&group_id=5470