[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)

Fri Aug 29 06:55:39 CEST 2014

On 29 August 2014 10:32, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
>
>  > The current proposal on the issue tracker is to instead take advantage of
>  > the existing error handlers:
>  >
>  >     def convert_surrogateescape(data, errors='replace'):
>  >         return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
>  >
>  > That code is short, but semantically dense
>
> And it doesn't implement your original suggestion of replacement with
> '?' (and another possibility for history buffs is 0x1A, ASCII SUB).  At
> least, AFAICT from the docs there's no way to specify the replacement
> character; decoding always uses U+FFFD.  (If I knew how to do that, I
> would have suggested this.)

If that actually matters in a given context, I can do an ordinary
string replacement later. I couldn't think of a case where it actually
mattered though - if "must be ASCII" was a requirement, then
backslashreplace was a suitable alternative that lost less information
(hence the RFE to make that also usable on input).

>  > (Added bonus: once you're alerted to the possibility, it's trivial
>  > to write your own version for existing Python 3 versions.
>
> I'm not sure that's true.  At least, to me that code was obvious -- I
> got the exact definition (except for the function name) on the first
> try -- but I ruled it out because it didn't implement your suggestion
> of replacement with '?', even as an option.

Yeah, part of the tracker discussion involved me realising that part
wasn't a necessary requirement - the key is being able to get rid of
the surrogates, or replace them with something readily identifiable,
and less about being able to control exactly what they get replaced
by.

> OTOH, I think a lot of the resistance to codec-based solutions is the
> misconception that en/decoding streams is expensive, or the
> misconception that Python's internal representation of text as an
> array of code points (rather than an array of "characters" or
> "grapheme clusters") is somehow insufficient for text processing.

We don't actually have any technical deep dives into how Python 3's
text handling works readily available online, so there's a lot of
speculation and misinformation floating around. My recent article
gives the high level context, but it really needs to be paired up with
a piece (or pieces) that go deep into the details of codec
optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE
Windows APIs, how the internal storage structure is determined at
allocation time, how it maintains compatibility with the legacy C
extension APIs, etc. The only current widely distributed articles on
those topics are written from a perspective that assumes we don't know
anything about Unicode, and are just making things unnecessarily
complicated (rather than solving hard cross platform compatibility and
text processing performance problems). That perspective is incorrect,
but "trust me, they're wrong" doesn't work very well with people that
are already angry.

Text manipulation is one of the most sophisticated subsystems in the
interpreter, though, so it's hard to know where to start on such a
series (and easy to get intimidated by the sheer magnitude of the work
involved in doing it right).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia