[Python-Dev] Bytes path related questions for Guido

Thu Aug 28 06:56:50 CEST 2014

On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> Glenn Linderman writes:
>   > On 8/26/2014 4:31 AM, MRAB wrote:
>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>   > >> Nick Coghlan writes:
>
>   > > How about:
>   > >
>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>   > >
>   > > If you want them removed, just pass an empty string as the
>   > > replacement.
>
> That seems better to me (I had too much C for breakfast, I think).
>
>   > And further, replacement could be a vector of 128 characters, to do
>   > immediate transcoding,
>
> Using what encoding?

The vector would contain the transcoding. Each lone surrogate would map 
to a character in the vector.

> If you knew that much, why didn't you use
> (write, if necessary) an appropriate codec?  I can't envision this
> being useful.

If the data format describes its encoding, possibly containing data from 
several encodings in various spots, then perhaps it is best read as 
binary, and processed as binary until those definitions are found.

But an alternative would be to read with surrogate escapes, and then 
when the encoding is determined, to transcode the data. Previously, a 
proposal was made to reverse the surrogate escapes to the original 
bytes, and then apply the (now known) appropriate codec. There are not 
appropriate codecs that can convert directly from surrogate escapes to 
the desired end result. This technique could be used instead, for 
single-byte, non-escaped encodings. On the other hand, writing specialty 
codecs for the purpose would be more general.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140827/bdb17539/attachment.html>