[I18n-sig] Passing unicode strings to file system calls

M.-A. Lemburg mal@lemburg.com
Wed, 17 Jul 2002 23:30:51 +0200


Martin v. Loewis wrote:
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> 
>>>That is broken beyond repair, and should not be used for anything. It
>>>can't possibly work.
>>
>>Hmm, why is that ?
> 
> 
> It tries to find out locale information from environment
> variables. That is bound to fail because:
> 
> - it may not know what variables to consider. In particular, on Unix,
>   it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
>   a number of errors when trying to find the encoding:

That's the search order which GNU readline uses (at least
at the time I wrote the code).

>   - if LANGUAGE is set, it is used to determine the encoding. This is
>     incorrect; LANGUAGE cannot be used for that. For example, with
>     LANGUAGE=german LANG=de_DE.UTF-8, it returns
>     ['de_DE', 'ISO8859-1']
>     This is incorrect; the encoding should have been UTF-8
> 
>   - it misses that LANGUAGE can contain contain colons to denote
>     fallbacks, on GNU/Linux; with
>     LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
>     ['de_DE', 'french']
>     This is even worse: french is not the name of an encoding

Interesting. Is the format documented somewhere ? It should be
easy to fix this.

> - it may not know the syntax of the environment variables. For
>   example, the current implementation breaks for "de_DE@euro"; this is
>   an SF bug report.

This should be fixable too. What does the '@euro' mean ? Does it
have to do with currency ?

> - it may not know the encoding associated with a locale. For example,
>   for de_DE@euro, it is Latin-9 on Linux today, but might be UTF-8 on
>   some other system. Likewise, locale.py just *knows* that de_DE means
>   ".iso-8859-1" on any system - that can be easily wrong.

Sure, but you normally only get the locale name and then
have to make an educated guess for the encoding. If the
encoding is known (e.g. by looking at the LANG environment
variable), then that infomration should override the
database information.

> - the language name return from getdefaultlocale is incorrect on
>   Windows, see
> 
>   http://groups.google.com/groups?selm=917pjb%24ii2%241%40reader1.imaginet.fr
> 
>   Users apparently expect that they can pass the result of
>   getdefaultlocale to setlocale, but this is not the case.

Hmm, the names returned by getdefaultlocale() and normalize()
are standards. I wonder what Windows expects to see for
setlocale().

>>There's a large database in locale.py for this and a few
>>support APIs which make use of it.
> 
> 
> That is the major problem. This database is incorrect, cannot be
> corrected, and is both unmaintained and unmaintainable.

I'd say, it's better than nothing :-)

>>It would probably be worthwhile to add an interface
>>encoding(localename) which only returns the encoding used per
>>default for that locale.
> 
> 
> I would make this getencoding(), and document that you need to call
> setlocale before, to make use of the user settings. The official way,
> on Unix, to obtain the locale's encoding is to use
> nl_langinfo(CODESET), which only works if the LC_CTYPE facet has been
> set. On Windows, locale._getdefaultlocale fortunately already returns
> the current codeset (which isn't influenced by setlocale, anyway).

Fine.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/