LANG, locale, unicode, setup.py and Debian packaging
donn.ingle at gmail.com
Sun Jan 13 12:51:58 CET 2008
> Yes. It does so when it fails to decode the byte string according to the
> file system encoding (which, in turn, bases on the locale).
That's at least one way I can weed-out filenames that are going to give me
trouble; if Python itself can't figure out how to decode it, then I can also
fail with honour.
> > I will try the technique given
> > on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html
> >#guessing-the-encoding Perhaps that will help.
> I would advise against such a strategy. Instead, you should first
> understand what the encodings of the file names actually *are*, on
> a real system, and draw conclusions from that.
I don't follow you here. The encoding of file names *on* a real system are
(for Linux) byte strings of potentially *any* encoding. os.listdir() may even
fail to grok some of them. So, I will have a few elements in a list that are
not unicode, I can't ask the O/S for any help and therefore I should be able
to pass that byte string to a function as suggested in the article to at
least take one last stab at identifying it.
Or is that a waste of time because os.listdir() has already tried something
similar (and prob. better)?
> I notice that this doesn't include "to allow the user to enter file
> names", so it seems there is no input of file names, only output.
I forgot to mention the command-line interface... I actually had trouble with
that too. The user can start the app like this:
And that introduces input in some kind of encoding. I hope that
locale.getprefferedencoding() will be the right one to handle that.
Is such input (passed-in via sys.argv) in byte-strings or unicode? I can find
out with type() I guess.
As to the rest, no, there's no other keyboard input for filenames. There *is*
a 'filter' which is used as a regex to filter 'bold', 'italic' or whatever. I
fully expect that to give me a hard time too.
> Then I suggest this technique of keeping bytestring/unicode string
> pairs. Use the Unicode string for display, and the byte string for
> accessing the disc.
Thanks, that's a good idea - I think I'll implement a dictionary to keep both
and work things that way.
> I see no problem with that:
> >>> u"M\xd6gul".encode("ascii","ignore")
> >>> u"M\xd6gul".encode("ascii","replace")
Well, that was what I expected to see too. I must have been doing something
More information about the Python-list