On 8/28/2014 12:30 AM, MRAB wrote:
On 2014-08-28 05:56, Glenn Linderman wrote:
On 8/27/2014 6:08 PM, Stephen J. Turnbull
wrote:
Glenn Linderman writes:
> On 8/26/2014 4:31 AM, MRAB wrote:
> > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
> >> Nick Coghlan writes:
> > How about:
> >
> > replace_surrogate_escapes(s,
replacement='\uFFFD')
> >
> > If you want them removed, just pass an empty
string as the
> > replacement.
That seems better to me (I had too much C for breakfast, I
think).
> And further, replacement could be a vector of 128
characters, to do
> immediate transcoding,
Using what encoding?
The vector would contain the transcoding. Each lone surrogate
would map
to a character in the vector.
If you knew that much, why didn't you
use
(write, if necessary) an appropriate codec? I can't envision
this
being useful.
If the data format describes its encoding, possibly containing
data from
several encodings in various spots, then perhaps it is best read
as
binary, and processed as binary until those definitions are
found.
But an alternative would be to read with surrogate escapes, and
then
when the encoding is determined, to transcode the data.
Previously, a
proposal was made to reverse the surrogate escapes to the
original
bytes, and then apply the (now known) appropriate codec. There
are not
appropriate codecs that can convert directly from surrogate
escapes to
the desired end result. This technique could be used instead,
for
single-byte, non-escaped encodings. On the other hand, writing
specialty
codecs for the purpose would be more general.
There'll be a surrogate escape if a byte couldn't be decoded, but
just
because a byte could be decoded, it doesn't mean that it's
correct.
If you picked the wrong encoding, the other codepoints could be
wrong
too.
Aha! Thanks for pointing out the flaw in my reasoning. But that
means it is also pretty useless to "replace_surrogate_escapes" at
all, because it only cleans out the non-decodable characters, not
the incorrectly decoded characters.