[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
tonynelson at georgeanelson.com
Mon Apr 27 20:08:51 CEST 2009
At 23:39 -0700 04/26/2009, Glenn Linderman wrote:
>On approximately 4/25/2009 5:35 AM, came the following characters from
>the keyboard of Martin v. Löwis:
>>> Because the encoding is not reliably reversible.
>> Why do you say that? The encoding is completely reversible
>> (unless we disagree on what "reversible" means).
>>> I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
>>> reversible encoding.
>> Then please provide an example for a setup where it is not reversible.
>It is reversible if you know that it is decoded, and apply the encoding.
> But if you don't know that has been encoded, then applying the reverse
>transform can convert an undecoded str that matches the decoded str to
>the form that it could have, but never did take.
>The problem is that there is no guarantee that the str interface
>provides only strictly conforming Unicode, so decoding bytes to
>non-strictly conforming Unicode, can result in a data pun between
>non-strictly conforming Unicode coming from the str interface vs bytes
>being decoded to non-strictly conforming Unicode coming from the bytes
Maybe this is a dumb idea, but some people might be reassured if the
half-surrogates had some particular pattern that is unlikely to occur even
in unreasonable text (as half-surrogates are an error in Unicode). The
pattern could be some sequence of half-surrogate encoded bytes, framing the
intended data, as is done for RFC 2047 internationalized header fields in
email. It would take up a few more bytes in the string, but no matter. It
would also make it easier to diagnose when decoding was not properly done.
FWIW, I like the idea in the PEP, now that I think I understand it.
(BTW, gotta love what the email package is doing to the Subject: header
TonyN.:' <mailto:tonynelson at georgeanelson.com>
More information about the Python-Dev