python 3.1 unicode question
duncan.booth at invalid.invalid
Wed Sep 16 19:21:54 CEST 2009
jeffunit <jeff at jeffunit.com> wrote:
>>That looks like a "surrogate escape" (See PEP 383)
>>http://www.python.org/dev/peps/pep-0383/. It indicates the wrong
>>encoding was used to decode the filename.
> That seems likely. How do I set the encoding to something correct to
> decode the filename?
> Clearly windows knows how to display it.
> I suspect since I complied python with cygwin, that it is using a
> POSIX standard,
> rather than a windows specific standard. Of course ideally, I would
> like my code to work
> on linux as well as windows, as I back up all of my data to a linux
> machine with
If you are running on a Linux system then the filenames are stored encoded
as bytes but the system does not store the encoding. In fact different
files in the same directory could use different encodings. That's why
Python 3.1 uses the surrogate escapes so that you can at least work with
the files even if you can't display the filenames.
If you are running on Windows and using the native Python to access an NTFS
formatted partition then there shouldn't be a problem: the filenames are
stored as unicode and Python uses the unicode apis. Of course you may still
not be able to display the filenames if they contain characters not
available in your output codepage.
If you use cygwin a quick search on Google turned up some old discussions
implying that it uses the 8 bit apis which convert characters using the
current codepage and converts characters it cannot handle to '?' but I have
no idea if that still applies.
More information about the Python-list