[Python-Dev] Bytes path related questions for Guido
Stephen J. Turnbull
stephen at xemacs.org
Thu Aug 28 08:30:44 CEST 2014
Glenn Linderman writes:
> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> > Glenn Linderman writes:
> > > And further, replacement could be a vector of 128 characters, to do
> > > immediate transcoding,
> > Using what encoding?
> The vector would contain the transcoding. Each lone surrogate would map
> to a character in the vector.
Yes, that's obvious. The question is where do you get the vector?
> > If you knew that much, why didn't you use (write, if necessary)
> > an appropriate codec? I can't envision this being useful.
> If the data format describes its encoding, possibly containing data from
> several encodings in various spots, then perhaps it is best read as
> binary, and processed as binary until those definitions are found.
Exactly. That's precisely why bytes have a .decode method.
> But an alternative would be to read with surrogate escapes, and
> then when the encoding is determined, to transcode the data.
Not every one-line expression needs to be in the stdlib:
data[start, end] = data[start, end].encode('utf-8', errors=surrogateescape).decode('DTRT-now')
Note that you *do* need to know start and end, because of the
possibility of "several encodings", where once you apply this
technique to the whole text, you can't recover the surrogates when you
get the encoding wrong.
> Previously, a proposal was made to reverse the surrogate escapes to
> the original bytes, and then apply the (now known) appropriate
Sure. And in fact I do this kind of thing all the time in Emacs,
using the decode(encode(slice)) approach. The only times in 25 years
of working with the insanity of digitized Japanese I've had a use for
anything other than that is when I don't have a round-tripping codec.
In that case I have to preserve the bytes or suffer lossy conversion
anyway, regardless of the method used to reconvert.
But surrogateescape is necessarily round-tripping (maybe with a few
exceptions in Chinese and a very small number in other languages, but
those failures are due to Unicode, not to surrogateescape).
> There are not appropriate codecs that can convert directly from
> surrogate escapes to the desired end result.
And there currently cannot be. codecs are bytes<->str, not str->str.
> This technique could be used instead, for single-byte, non-escaped
That's pure theory, not a use case. We have codecs for all the
encodings with significant numbers of users, and writing a new one
simply isn't that hard.
More information about the Python-Dev