LANG, locale, unicode, setup.py and Debian packaging
"Martin v. Löwis"
martin at v.loewis.de
Sun Jan 13 12:26:17 CET 2008
> I have found that os.listdir() does not always return unicode objects when
> passed a unicode path. Sometimes "byte strings" are returned in the list,
> mixed-in with unicodes.
Yes. It does so when it fails to decode the byte string according to the
file system encoding (which, in turn, bases on the locale).
> I will try the technique given
> Perhaps that will help.
I would advise against such a strategy. Instead, you should first
understand what the encodings of the file names actually *are*, on
a real system, and draw conclusions from that.
> I gather you mean that I should get a unicode path, encode it to a byte string
> and then pass that to os.listdir
> Then, I suppose, I will have to decode each resulting byte string (via the
> detect routines mentioned in the link above) back into unicode - passing
> those I simply cannot interpret.
That's what I meant, yes. Again, you have a number of options - passing
those that you cannot interpret is but one option. Another option is to
>> Then, if the locale's encoding cannot decode the file names, you have
>> several options
>> a) don't try to interpret the file names as character strings, i.e.
>> don't decode them. Not sure why you need the file names - if it's
>> only to open the files, and never to present the file name to the
>> user, not decoding them might be feasible
> So, you reckon I should stick to byte-strings for the low-level file open
> stuff? It's a little complicated by my using Python Imaging to access the
> font files. It hands it all over to Freetype and really leaves my sphere of
> I'll do some testing with PIL and byte-string filenames. I wish my memory was
> better, I'm pretty sure I've been down that road and all my results kept
> pushing me to stick to unicode objects as far as possible.
I would be surprised if PIL/freetype would not support byte string file
names if you read those directly from the disk. OTOH, if the user has
selected/typed a string at a GUI, and you encode that - I can easily
see how that might have failed.
>> That's correct, and there is no solution (not in Python, not in any
>> other programming language). You have to made trade-offs. For that,
>> you need to analyze precisely what your requirements are.
> I would say the requirements are:
> 1. To open font files from any source (locale.)
> 2. To display their filename on the gui and the console.
> 3. To fetch some text meta-info (family etc.) via PIL/Freetype and display
> 4. To write the path and filename to text files.
> 5. To make soft links (path + filename) to another path.
> So, there's a lot of unicode + unicode and os.path.join and so forth going on.
I notice that this doesn't include "to allow the user to enter file
names", so it seems there is no input of file names, only output.
Then I suggest this technique of keeping bytestring/unicode string
pairs. Use the Unicode string for display, and the byte string for
accessing the disc.
>>> I went through this exercise recently and had no joy. It seems the string
>>> I chose to use simply would not render - even under 'ignore' and
>> I don't understand what "would not render" means.
> I meant it would not print the name, but constantly throws ascii related
That cannot be. Both the ignore and the replace error handlers will
silence all decoding errors.
> I don't know if the character will survive this email, but the text I was
> trying to display (under LANG=C) in a python script (not the immediate-mode
> interpreter) was: "MÖgul". The second character is a capital O with an umlaut
> (double-dots I think) above it. For some reason I could not get that to
> display as "M?gul" or "Mgul".
I see no problem with that:
More information about the Python-list