LANG, locale, unicode, and Debian packaging

Donn donn.ingle at
Sun Jan 13 12:51:58 CET 2008

> Yes. It does so when it fails to decode the byte string according to the
> file system encoding (which, in turn, bases on the locale).
That's at least one way I can weed-out filenames that are going to give me 
trouble; if Python itself can't figure out how to decode it, then I can also 
fail with honour.

> > I will try the technique given
> > on:
> >#guessing-the-encoding Perhaps that will help.
> I would advise against such a strategy. Instead, you should first
> understand what the encodings of the file names actually *are*, on
> a real system, and draw conclusions from that.
I don't follow you here. The encoding of file names *on* a real system are 
(for Linux) byte strings of potentially *any* encoding. os.listdir() may even 
fail to grok some of them. So, I will have a few elements in a list that are 
not unicode, I can't ask the O/S for any help and therefore I should be able 
to pass that byte string to a function as suggested in the article to at 
least take one last stab at identifying it. 
 Or is that a waste of time because os.listdir() has already tried something 
similar (and prob. better)?

> I notice that this doesn't include "to allow the user to enter file
> names", so it seems there is no input of file names, only output.
I forgot to mention the command-line interface... I actually had trouble with 
that too. The user can start the app like this:
fontypython /some/folder/
fontypython SomeFileName
And that introduces input in some kind of encoding. I hope that 
locale.getprefferedencoding() will be the right one to handle that.

Is such input (passed-in via sys.argv) in byte-strings or unicode? I can find 
out with type() I guess.

As to the rest, no, there's no other keyboard input for filenames. There *is* 
a 'filter' which is used as a regex to filter 'bold', 'italic' or whatever. I 
fully expect that to give me a hard time too.

> Then I suggest this technique of keeping bytestring/unicode string
> pairs. Use the Unicode string for display, and the byte string for
> accessing the disc.
Thanks, that's a good idea - I think I'll implement a dictionary to keep both 
and work things that way.

> I see no problem with that:
> >>> u"M\xd6gul".encode("ascii","ignore")
> 'Mgul'
> >>> u"M\xd6gul".encode("ascii","replace")
> 'M?gul'
Well, that was what I expected to see too. I must have been doing something 


More information about the Python-list mailing list