[Python-Dev] Bytes path related questions for Guido
python at mrabarnett.plus.com
Thu Aug 28 09:30:39 CEST 2014
On 2014-08-28 05:56, Glenn Linderman wrote:
> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>> Glenn Linderman writes:
>> > On 8/26/2014 4:31 AM, MRAB wrote:
>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>> > >> Nick Coghlan writes:
>> > > How about:
>> > >
>> > > replace_surrogate_escapes(s, replacement='\uFFFD')
>> > >
>> > > If you want them removed, just pass an empty string as the
>> > > replacement.
>> That seems better to me (I had too much C for breakfast, I think).
>> > And further, replacement could be a vector of 128 characters, to do
>> > immediate transcoding,
>> Using what encoding?
> The vector would contain the transcoding. Each lone surrogate would map
> to a character in the vector.
>> If you knew that much, why didn't you use
>> (write, if necessary) an appropriate codec? I can't envision this
>> being useful.
> If the data format describes its encoding, possibly containing data from
> several encodings in various spots, then perhaps it is best read as
> binary, and processed as binary until those definitions are found.
> But an alternative would be to read with surrogate escapes, and then
> when the encoding is determined, to transcode the data. Previously, a
> proposal was made to reverse the surrogate escapes to the original
> bytes, and then apply the (now known) appropriate codec. There are not
> appropriate codecs that can convert directly from surrogate escapes to
> the desired end result. This technique could be used instead, for
> single-byte, non-escaped encodings. On the other hand, writing specialty
> codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but just
because a byte could be decoded, it doesn't mean that it's correct.
If you picked the wrong encoding, the other codepoints could be wrong
More information about the Python-Dev