[Python-Dev] Bytes path related questions for Guido

Thu Aug 28 09:30:39 CEST 2014

On 2014-08-28 05:56, Glenn Linderman wrote:
> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>> Glenn Linderman writes:
>>   > On 8/26/2014 4:31 AM, MRAB wrote:
>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>   > >> Nick Coghlan writes:
>>
>>   > > How about:
>>   > >
>>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>>   > >
>>   > > If you want them removed, just pass an empty string as the
>>   > > replacement.
>>
>> That seems better to me (I had too much C for breakfast, I think).
>>
>>   > And further, replacement could be a vector of 128 characters, to do
>>   > immediate transcoding,
>>
>> Using what encoding?
>
> The vector would contain the transcoding. Each lone surrogate would map
> to a character in the vector.
>
>> If you knew that much, why didn't you use
>> (write, if necessary) an appropriate codec?  I can't envision this
>> being useful.
>
> If the data format describes its encoding, possibly containing data from
> several encodings in various spots, then perhaps it is best read as
> binary, and processed as binary until those definitions are found.
>
> But an alternative would be to read with surrogate escapes, and then
> when the encoding is determined, to transcode the data. Previously, a
> proposal was made to reverse the surrogate escapes to the original
> bytes, and then apply the (now known) appropriate codec. There are not
> appropriate codecs that can convert directly from surrogate escapes to
> the desired end result. This technique could be used instead, for
> single-byte, non-escaped encodings. On the other hand, writing specialty
> codecs for the purpose would be more general.
>
There'll be a surrogate escape if a byte couldn't be decoded, but just
because a byte could be decoded, it doesn't mean that it's correct.

If you picked the wrong encoding, the other codepoints could be wrong
too.