[I18n-sig] Passing unicode strings to file system calls

M.-A. Lemburg mal@lemburg.com
Thu, 18 Jul 2002 14:53:41 +0200

Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
>>>- it may not know what variables to consider. In particular, on Unix,
>>>  it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
>>>  a number of errors when trying to find the encoding:
>>That's the search order which GNU readline uses (at least
>>at the time I wrote the code).
> GNU readline does not check LANGUAGE, and it uses setlocale if
> available (so you are talking about rarely-used fallback code).

See the gettext man page:

         If the LANGUAGE environment variable is set to a nonempty  value,  and
        the  locale is not the "C" locale, the value of LANGUAGE is assumed to
        contain a colon separated list of locale  names.  The  functions  will
        attempt  to  look  up a translation of msgid in each of the locales in
        turn. This is a GNU extension.

>>>  - it misses that LANGUAGE can contain contain colons to denote
>>>    fallbacks, on GNU/Linux; with
>>>    LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
>>>    ['de_DE', 'french']
>>>    This is even worse: french is not the name of an encoding
>>Interesting. Is the format documented somewhere ? It should be
>>easy to fix this.
> Of LANGUAGE? I believe it's documented in the gettext documentation.

Yes. It looks as if parsing LANGUAGE is the wrong thing
to do if you're looking for the default locale (ie. the one
which is used at process startup time before any calls
to setlocale()).

>>>- it may not know the syntax of the environment variables. For
>>>  example, the current implementation breaks for "de_DE@euro"; this is
>>>  an SF bug report.
>>This should be fixable too. What does the '@euro' mean ? Does it
>>have to do with currency ?
> In a way. It is a "locale variant". A variant could be just about
> anything. Common variants are @euro (used to denote the variant that
> has the Euro for LC_CURRENCY), @nynorsk (used to tell apart the two
> Norwegian languages - now nb and no), and @xim, used for X Input
> Methods (like @xim=kinput2). It could be used for many other things,
> too.
> You can fix the parsing of the variants, but you cannot infer the
> encoding.

Why not ? I know that several locales use more than one
encoding for their script(s), but having at least a hint
is better than no information at all.

Of course, if the system provides different means of
accessing this information, then those means should be
used instead.

>>Sure, but you normally only get the locale name and then
>>have to make an educated guess for the encoding. 
> That is my point: This algorithm must guess, and it *will* guess
> wrong.

I've never said that it will always guess right. AFAIK,
there is no platform independent solution to the problem.
I am all for adding more support for platform specific
solutions, though.

>>If the encoding is known (e.g. by looking at the LANG environment
>>variable), then that infomration should override the database
> In this specific case (of the @euro domains), the LANG variable does
> not explicitly mention the encoding. So that doesn't help.

It can be used as hint, e.g. in Germany we use Latin-1 as
encoding, so that's a good assumption.

>>Hmm, the names returned by getdefaultlocale() and normalize()
>>are standards. I wonder what Windows expects to see for
> What standards? Posix? That has never impressed Microsoft. Instead of
> "fr_FR.cp1252", they accept "French_France.1252". That may even be
> Posix-conforming, though, which allows "<lang>_<country>.<codeset>".
> Locale names are *not* standard. An algorithm that assumes that they
> are is broken.

I didn't say that locale names are always standard. To the contrary:
I added the normalize() API to locale.py to map some of the
commonly used non-standard locale names to the standards
compatible ones (ISO 639 language code + <underscore> ISO 3166
country code).

>>I'd say, it's better than nothing :-)
> Yes, that's why I propose to provide a replacement, and then deprecate
> the existing function.

Why a replacement and what kind of replacement ? It should well
be possible to add more support to the existing APIs and
perhaps extend them with new ones.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/