[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Thu Apr 30 16:06:33 EDT 2009

>>> How do get a printable unicode version of these path strings if they
>>> contain none unicode data?
>>
>> Define "printable". One way would be to use a regular expression,
>> replacing all codes in a certain range with a question mark.
> 
> What I mean by printable is that the string must be valid unicode
> that I can print to a UTF-8 console or place as text in a UTF-8
> web page.
> 
> I think your PEP gives me a string that will not encode to
> valid UTF-8 that the outside of python world likes. Did I get this
> point wrong?

You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return ""

This will always return a printable version of the input string...

> In our application we are running fedora with the assumption that the
> filenames are UTF-8. When Windows systems FTP files to our system
> the files are in CP-1251(?) and not valid UTF-8.

That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.

> Having an algorithm that says if its a string no problem, if its
> a byte deal with the exceptions seems simple.
> 
> How do I do this detection with the PEP proposal?
> Do I end up using the byte interface and doing the utf-8 decode
> myself?

No, you should encode using the "strict" error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.

Regards,
Martin