[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Adam Olsen rhamph at gmail.com
Mon Sep 29 23:57:45 CEST 2008


On Mon, Sep 29, 2008 at 5:12 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Adam Olsen <rhamph <at> gmail.com> writes:
>>
>> UTF-8b doesn't work as intended.  It produces an invalid unicode
>> object (garbage surrogates) that cannot be used with external APIs or
>> libraries that require unicode.
>
> At least it works with all Python operations supported by the unicode type
> (methods, concatenation, etc.) without any bad surprise. That feeding it to e.g.
> PyGTK may give bogus results is another problem.
>
>> If you don't need unicode then your
>> code should state so explicitly, and 8859-1 is ideal there.
>
> But then you can say bye-bye to proper representation (e.g. using print()) of
> even valid filenames.

You can't print UTF-8b either.  Printing requires converting the
unicode object to UTF-8 (or whatever output encoding), and the unicode
object isn't valid, so you'd get an exception[1].

The same applies to all other hacks (such as PUA scalars).  Either the
scalar value already has an expected behaviour, in which case decoding
is lossy and reencoding replaces the correct behaviour, or it's not a
valid scalar value, which then can't be used with any external API
that requires conformant unicode.  There's no solution except to not
decode, and 8859-1 is the way to do that.


[1] Python's UTF codecs are broken in a couple respects, including the
fact that python itself uses CESU-8(!).  See
http://bugs.python.org/issue3297 and http://bugs.python.org/issue3672


-- 
Adam Olsen, aka Rhamphoryncus


More information about the Python-3000 mailing list