[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

MRAB google at mrabarnett.plus.com
Thu Apr 30 22:07:41 CEST 2009

Barry Scott wrote:
> On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:
>>> How do get a printable unicode version of these path strings if they
>>> contain none unicode data?
>> Define "printable". One way would be to use a regular expression,
>> replacing all codes in a certain range with a question mark.
> What I mean by printable is that the string must be valid unicode
> that I can print to a UTF-8 console or place as text in a UTF-8
> web page.
> I think your PEP gives me a string that will not encode to
> valid UTF-8 that the outside of python world likes. Did I get this
> point wrong?
>>> I'm guessing that an app has to understand that filenames come in two 
>>> forms
>>> unicode and bytes if its not utf-8 data. Why not simply return string if
>>> its valid utf-8 otherwise return bytes?
>> That would have been an alternative solution, and the one that 2.x uses
>> for listdir. People didn't like it.
> In our application we are running fedora with the assumption that the
> filenames are UTF-8. When Windows systems FTP files to our system
> the files are in CP-1251(?) and not valid UTF-8.
> What we have to do is detect these non UTF-8 filename and get the
> users to rename them.
> Having an algorithm that says if its a string no problem, if its
> a byte deal with the exceptions seems simple.
> How do I do this detection with the PEP proposal?
> Do I end up using the byte interface and doing the utf-8 decode
> myself?
What do you do currently?

The PEP just offers a way of reading all filenames as Unicode, if that's
what you want. So what if the strings can't be encoded to normal UTF-8!
The filenames aren't valid UTF-8 anyway! :-)

More information about the Python-Dev mailing list