LANG, locale, unicode, setup.py and Debian packaging
donn.ingle at gmail.com
Sun Jan 13 08:30:07 CET 2008
I really appreciate your reply. I have been working in a vacuum on this and
without any experience. I hope you don't mind if I ask you a bunch of
questions. If I can get over some conceptual 'humps' then I'm sure I can
produce a better app.
> That's a bug in the app. It shouldn't assume that environment variables
> are UTF-8. Instead, it should assume that they are in the locale's
> encoding, and compute that encoding with locale.getpreferredencoding.
I see what you are saying and agree, and I am confused about files and
filenames. My app has to handle font files which can come from anywhere. If
the locale (locale.getpreferredencoding) returns something like "ANSI" and I
am doing an os.listdir() then I lose the plot a little...
It seems to me that filenames are like snapshots of the locales where they
originated. If there's a font file from India and I want to open it on my
system in South Africa (and I have LANG=C) then it seems that it's impossible
to do. If I access the filename it throws a unicodeDecodeError. If I
use 'replace' or 'ignore' then I am mangling the filename and I won't be able
to open it.
The same goes for adding 'foreign' filenames to paths with any kind of string
My (admittedly uninformed) conception is that by forcing the app to always
use utf8 I can access any filename in any encoding. The problem seems to be
that I cannot know *what* encoding (and I get encode/decode mixed up still,
very new to it all) that particular filename is in.
Am I right? Wrong? Deluded? :) Please fill me in.
> If you print non-ASCII strings to the terminal, and you can't be certain
> that the terminal supports the encoding in the string, and you can't
> reasonably deal with the exceptions, you should accept moji-bake, by
> specifying the "replace" error handler when converting strings to the
> terminal's encoding.
I went through this exercise recently and had no joy. It seems the string I
chose to use simply would not render - even under 'ignore' and 'replace'.
It's really frustrating because I don't speak a non-ascii language and so
can't know if I am testing real-world strings or crazy Tolkein strings.
Another aspect of this is wxPython. My app requires the unicode build so that
strings have some hope of displaying on the widgets. If I access a font file
and fetch the family name - that can be encoded in any way, again unknown,
and I want to fetch it as 'unicode' and pass it to the widgets and not worry
about what's really going on. Given that, I thought I'd extend the 'utf8'
only concept to the app in general. I am sure I am wrong, but I feel cornered
at the moment.
> > 3. I made the decision to check the locale and stop the app if the return
> > from getlocale is (None,None).
> I would avoid locale.getlocale. It's a pointless function (IMO).
Could you say why?
Here's my use of it:
locale.setlocale( locale.LC_ALL, "" )
loc = locale.getlocale()
if loc == None:
loc = locale.getlocale()
if loc == (None, None):
print localeHelp # not utf-8 (I think)
# Now gettext
domain = "all"
gettext.install( domain, localedir, unicode = True )
lang = gettext.translation(domain, localedir, languages = [loc] )
lang.install(unicode = True )
So, I am using getlocale to get a tuple/list (easy, no?) to pass to the
> Your program definitely, absolutely must work in the C locale. Of
> course, you cannot have any non-ASCII characters in that locale, so
> deal with it.
This would mean cutting-out a percentage of the external font files that can
be used by the app. Is there no modern standard regarding the LANG variable
and locales these days? My locale -a reports a bunch of xx_XX.utf8 locales.
Does it even make sense to use a non-utf8 locale anymore?
> If you have solved that, chances are high that it will work in other
> locales as well (but be sure to try Turkish, as that gives a
> surprising meaning to "I".lower()).
Oh boy, this gives me cold chills. I don't have the resources to start
worrying about every single language's edge-cases. This is kind of why I was
leaning towards a "use a utf8 locale please" approach.
Fonty Python and other dev news at:
More information about the Python-list