[Python-Dev] Bytes path related questions for Guido

Thu Aug 28 19:15:40 CEST 2014

On 8/28/2014 12:30 AM, MRAB wrote:
> On 2014-08-28 05:56, Glenn Linderman wrote:
>> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>>> Glenn Linderman writes:
>>>   > On 8/26/2014 4:31 AM, MRAB wrote:
>>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>>   > >> Nick Coghlan writes:
>>>
>>>   > > How about:
>>>   > >
>>>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>>>   > >
>>>   > > If you want them removed, just pass an empty string as the
>>>   > > replacement.
>>>
>>> That seems better to me (I had too much C for breakfast, I think).
>>>
>>>   > And further, replacement could be a vector of 128 characters, to do
>>>   > immediate transcoding,
>>>
>>> Using what encoding?
>>
>> The vector would contain the transcoding. Each lone surrogate would map
>> to a character in the vector.
>>
>>> If you knew that much, why didn't you use
>>> (write, if necessary) an appropriate codec?  I can't envision this
>>> being useful.
>>
>> If the data format describes its encoding, possibly containing data from
>> several encodings in various spots, then perhaps it is best read as
>> binary, and processed as binary until those definitions are found.
>>
>> But an alternative would be to read with surrogate escapes, and then
>> when the encoding is determined, to transcode the data. Previously, a
>> proposal was made to reverse the surrogate escapes to the original
>> bytes, and then apply the (now known) appropriate codec. There are not
>> appropriate codecs that can convert directly from surrogate escapes to
>> the desired end result. This technique could be used instead, for
>> single-byte, non-escaped encodings. On the other hand, writing specialty
>> codecs for the purpose would be more general.
>>
> There'll be a surrogate escape if a byte couldn't be decoded, but just
> because a byte could be decoded, it doesn't mean that it's correct.
>
> If you picked the wrong encoding, the other codepoints could be wrong
> too.

Aha! Thanks for pointing out the flaw in my reasoning. But that means it 
is also pretty useless to "replace_surrogate_escapes" at all, because it 
only cleans out the non-decodable characters, not the incorrectly 
decoded characters.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140828/a0723244/attachment.html>