[Python-Dev] Security implications of pep 383

Thu Mar 31 04:30:43 CEST 2011

On 3/30/2011 6:39 PM, Toshio Kuratomi wrote:

> Really, surrogates are a red herring to this whole issue.  The issue is that
> the original code was trying to compare two different transformations of
> byte sequences and expecting them to be equal.  Let's say that you have the
> following byte value::
>    b_test_value = b'\xa4\xaf'
>
> This is something that's stored in a file or the filename of something on
> a unix filesystem or stored in a database or any number of other things.
> Now you want to compare that to another piece of data that you've read in
> from somewhere outside of python.  You'd expect any of the following to
> work::
>    b_test_value == b_other_byte_value
>    b_test_value.encode('utf-8', 'surrogateescape') == b_other_byte_value('utf-8', 'surrogateescape')
>    b_test_value.encode('latin-1') == b_other_byte_value('latin-1')
>    b_test_value.encode('euc_jp') == b_other_byte_value('euc_jp')
>
> You wouldn't expect this to work::
>    b_test_value.encode('latin-1') == b_other_byte_value('euc_jp')
>
> Once you see that, you realize that the following is only a specific case of
> the former, surrogateescape doesn't really matter::
>    b_test_value.encode('utf-8', 'surrogateescape') == b_other_byte_value('euc_jp')

All the encodes above should be decodes instead. Aside from that. your 
point is correct, and not limited to CS. The whole art of disguise, for 
instance, is about effecting a transformation to falsely pass or fail an 
identity or equality comparison.

-- 
Terry Jan Reedy