[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 30 21:43:24 CEST 2009

On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:

>> How do get a printable unicode version of these path strings if they
>> contain none unicode data?
>
> Define "printable". One way would be to use a regular expression,
> replacing all codes in a certain range with a question mark.

What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?

>
>
>> I'm guessing that an app has to understand that filenames come in  
>> two forms
>> unicode and bytes if its not utf-8 data. Why not simply return  
>> string if
>> its valid utf-8 otherwise return bytes?
>
> That would have been an alternative solution, and the one that 2.x  
> uses
> for listdir. People didn't like it.

In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?

Barry