[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Oct 15 08:26:29 CEST 2006

Marcin 'Qrczak' Kowalczyk schrieb:
> It is true that it can change the interpretation of file contents.
> This is unavoidable. Unless someone uses unpaired surrogates for this
> purpose (or code points above U+10FFFF) - I've seen such proposals,
> but IMHO they are abusing rules too far.

It's not exactly unavoidable: any escaping mechanism can support the
full range of valid input. In your escaping mechanism, you could
duplicate 0 bytes on decoding, and write a null byte if you have two
subsequent NUL characters on encoding.

I still think that PUA characters would be a better use: in your
encoding, you get two characters of encoded text for one byte of
input; if people need to render the file name, this will be confusing.
With a PUA character, rendering will still produce moji-bake, but
you will likely get one "box" of output for what the user thinks
should be one character.

Refining my last proposal: I think there should be a "pass-through"
error handler for codecs which puts undecodable bytes into PUA
characters, and encodes unencodable characters from the PUA range
into the corresponding bytes. This could lie on top of existing
codecs, and help to decode undecodable file names in a way
that round-trips.

Regards,
Martin