[I18n-sig] Passing unicode strings to file system calls

Martin v. Loewis martin@v.loewis.de
17 Jul 2002 23:13:01 +0200

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > That is broken beyond repair, and should not be used for anything. It
> > can't possibly work.
> Hmm, why is that ?

It tries to find out locale information from environment
variables. That is bound to fail because:

- it may not know what variables to consider. In particular, on Unix,
  it tries LANGUAGE, LC_ALL, LC_CTYPE, and LANG. In doing so, it makes
  a number of errors when trying to find the encoding:

  - if LANGUAGE is set, it is used to determine the encoding. This is
    incorrect; LANGUAGE cannot be used for that. For example, with
    LANGUAGE=german LANG=de_DE.UTF-8, it returns
    ['de_DE', 'ISO8859-1']
    This is incorrect; the encoding should have been UTF-8

  - it misses that LANGUAGE can contain contain colons to denote
    fallbacks, on GNU/Linux; with
    LANGUAGE=german:french LANG=de_DE.UTF-8, it returns
    ['de_DE', 'french']
    This is even worse: french is not the name of an encoding

- it may not know the syntax of the environment variables. For
  example, the current implementation breaks for "de_DE@euro"; this is
  an SF bug report.

- it may not know the encoding associated with a locale. For example,
  for de_DE@euro, it is Latin-9 on Linux today, but might be UTF-8 on
  some other system. Likewise, locale.py just *knows* that de_DE means
  ".iso-8859-1" on any system - that can be easily wrong.

- the language name return from getdefaultlocale is incorrect on
  Windows, see


  Users apparently expect that they can pass the result of
  getdefaultlocale to setlocale, but this is not the case.

> There's a large database in locale.py for this and a few
> support APIs which make use of it.

That is the major problem. This database is incorrect, cannot be
corrected, and is both unmaintained and unmaintainable.

> It would probably be worthwhile to add an interface
> encoding(localename) which only returns the encoding used per
> default for that locale.

I would make this getencoding(), and document that you need to call
setlocale before, to make use of the user settings. The official way,
on Unix, to obtain the locale's encoding is to use
nl_langinfo(CODESET), which only works if the LC_CTYPE facet has been
set. On Windows, locale._getdefaultlocale fortunately already returns
the current codeset (which isn't influenced by setlocale, anyway).