[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tony Nelson tonynelson at georgeanelson.com
Mon Apr 27 20:08:51 CEST 2009


At 23:39 -0700 04/26/2009, Glenn Linderman wrote:
>On approximately 4/25/2009 5:35 AM, came the following characters from
>the keyboard of Martin v. Löwis:
>>> Because the encoding is not reliably reversible.
>>
>> Why do you say that? The encoding is completely reversible
>> (unless we disagree on what "reversible" means).
>>
>>> I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
>>> reversible encoding.
>>
>> Then please provide an example for a setup where it is not reversible.
>>
>> Regards,
>> Martin
>
>It is reversible if you know that it is decoded, and apply the encoding.
>  But if you don't know that has been encoded, then applying the reverse
>transform can convert an undecoded str that matches the decoded str to
>the form that it could have, but never did take.
>
>The problem is that there is no guarantee that the str interface
>provides only strictly conforming Unicode, so decoding bytes to
>non-strictly conforming Unicode, can result in a data pun between
>non-strictly conforming Unicode coming from the str interface vs bytes
>being decoded to non-strictly conforming Unicode coming from the bytes
>interface.
 ...

Maybe this is a dumb idea, but some people might be reassured if the
half-surrogates had some particular pattern that is unlikely to occur even
in unreasonable text (as half-surrogates are an error in Unicode).  The
pattern could be some sequence of half-surrogate encoded bytes, framing the
intended data, as is done for RFC 2047 internationalized header fields in
email.  It would take up a few more bytes in the string, but no matter.  It
would also make it easier to diagnose when decoding was not properly done.

FWIW, I like the idea in the PEP, now that I think I understand it.

(BTW, gotta love what the email package is doing to the Subject: header
field. ;-')
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>


More information about the Python-Dev mailing list