LANG, locale, unicode, setup.py and Debian packaging

Sun Jan 13 04:27:38 EST 2008

>  It seems to me that filenames are like snapshots of the locales where they 
> originated.

On Unix, yes. On Windows, NTFS and VFAT represent file names as Unicode
strings always, independent of locale. POSIX file names are byte
strings, and there isn't any good support for recording what their
encoding is.

> If there's a font file from India and I want to open it on my 
> system in South Africa (and I have LANG=C) then it seems that it's impossible 
> to do. If I access the filename it throws a unicodeDecodeError. If I 
> use 'replace' or 'ignore' then I am mangling the filename and I won't be able 
> to open it.

Correct. Notice that there are two ways (currently) in Python to get a
directory listing: with a Unicode directory name, which returns Unicode
strings, and with a byte string directory name, which returns byte
strings. If you think you may have file names with mixed locales, and
the current locale might not match the file name's locale, you should
be using the byte string variant on Unix (which it seems you are already
doing).

Then, if the locale's encoding cannot decode the file names, you have
several options
a) don't try to interpret the file names as character strings, i.e.
   don't decode them. Not sure why you need the file names - if it's
   only to open the files, and never to present the file name to the
   user, not decoding them might be feasible
b) guess an encoding. For file names on Linux, UTF-8 is fairly common,
   so it might be a reasonable guess.
c) accept lossy decoding, i.e. decode with some encoding, and use
   "replace" as the error handler. You'll have to preserve the original
   file names along with the decoded versions if you later also want to
   operate on the original file.

>  My (admittedly uninformed) conception is that by forcing the app to always 
> use utf8 I can access any filename in any encoding.

That's not true. Try open("\xff","w"), then try interpreting the file
name as UTF-8. Some byte strings are not meaningful UTF-8, hence that
approach cannot work.

You *can* interpret all file names as ISO-8859-1, but then some file
names will show moji-bake.

> The problem seems to be 
> that I cannot know *what* encoding (and I get encode/decode mixed up still, 
> very new to it all) that particular filename is in.

That's correct, and there is no solution (not in Python, not in any
other programming language). You have to made trade-offs. For that,
you need to analyze precisely what your requirements are.

> I went through this exercise recently and had no joy. It seems the string I 
> chose to use simply would not render - even under 'ignore' and 'replace'. 

I don't understand what "would not render" means.

> It's really frustrating because I don't speak a non-ascii language and so 
> can't know if I am testing real-world strings or crazy Tolkein strings.

I guess your choices are to either give up, or learn.

> Another aspect of this is wxPython. My app requires the unicode build so that 
> strings have some hope of displaying on the widgets. If I access a font file 
> and fetch the family name - that can be encoded in any way, again unknown, 
> and I want to fetch it as 'unicode' and pass it to the widgets and not worry 
> about what's really going on. Given that, I thought I'd extend the 'utf8' 
> only concept to the app in general. I am sure I am wrong, but I feel cornered 
> at the moment.

Don't confuse "utf8 only" with "unicode only". Having all strings as
Unicode strings is a good thing. Assuming that all encoded text is
encoded in UTF-8 (which is but one encoding for Unicode) is likely
incorrect.

As for font files - I don't know what encoding the family is in, but
I would sure hope that the format specification of the font file format
would also specify what the encoding for the family name is, or that
there are at least established conventions.

>>> 3. I made the decision to check the locale and stop the app if the return
>>> from getlocale is (None,None).
>> I would avoid locale.getlocale. It's a pointless function (IMO).
> Could you say why?

It tries to emulate the C library, but does so incorrectly; this
is inherently unfixable because behavior of the C library can vary
across platforms, and Python can't possibly encode the behavior
of all C libraries in existence on all platforms. In particular,
it has a hard-coded list of what charsets are in use in what locale,
and that list necessarily must be incomplete and may be incorrect.

As a consequence, it will return None if it doesn't know better.
If all you want is the charset of the locale, use
locale.getpreferredencoding().

> gettext.install( domain, localedir, unicode = True )
> lang = gettext.translation(domain, localedir, languages = [loc] )

You could just leave out the languages parameter, and trust gettext
to find some message catalog.

> So, I am using getlocale to get a tuple/list (easy, no?) to pass to the 
> gettext.install function.

Sure - but the parameter is optional.

>> Your program definitely, absolutely must work in the C locale. Of
>> course, you cannot have any non-ASCII characters in that locale, so
>> deal with it.
> This would mean cutting-out a percentage of the external font files that can 
> be used by the app. 

See above. There are other ways to trade-off. Alternatively, you could
require that the program finds a richer locale, and bail out if the
locale is just "C".

> Is there no modern standard regarding the LANG variable 
> and locales these days? My locale -a reports a bunch of xx_XX.utf8 locales. 
> Does it even make sense to use a non-utf8 locale anymore?

It's not your choice, but the user's. People still use non-UTF-8 locales
heavily, and likely will continue to do so for at least 10 more years.

> Oh boy, this gives me cold chills. I don't have the resources to start 
> worrying about every single language's edge-cases. This is kind of why I was 
> leaning towards a "use a utf8 locale please" approach.

That doesn't help. For Turkish in particular, the UTF-8 locale is worse
than the ISO-8859-9 locale, as the lowercase I takes two bytes in UTF-8,
so tolower can't really work in the UTF-8 locale (but can in the
ISO-8859-9 locale).

Regards,
Martin