On 29 August 2014 10:32, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Nick Coghlan writes:
The current proposal on the issue tracker is to instead take advantage of the existing error handlers:
def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense
And it doesn't implement your original suggestion of replacement with '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At least, AFAICT from the docs there's no way to specify the replacement character; decoding always uses U+FFFD. (If I knew how to do that, I would have suggested this.)
If that actually matters in a given context, I can do an ordinary string replacement later. I couldn't think of a case where it actually mattered though - if "must be ASCII" was a requirement, then backslashreplace was a suitable alternative that lost less information (hence the RFE to make that also usable on input).
(Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions.
I'm not sure that's true. At least, to me that code was obvious -- I got the exact definition (except for the function name) on the first try -- but I ruled it out because it didn't implement your suggestion of replacement with '?', even as an option.
Yeah, part of the tracker discussion involved me realising that part wasn't a necessary requirement - the key is being able to get rid of the surrogates, or replace them with something readily identifiable, and less about being able to control exactly what they get replaced by.
OTOH, I think a lot of the resistance to codec-based solutions is the misconception that en/decoding streams is expensive, or the misconception that Python's internal representation of text as an array of code points (rather than an array of "characters" or "grapheme clusters") is somehow insufficient for text processing.
We don't actually have any technical deep dives into how Python 3's text handling works readily available online, so there's a lot of speculation and misinformation floating around. My recent article gives the high level context, but it really needs to be paired up with a piece (or pieces) that go deep into the details of codec optimisation, the UTF-8 caching, how it integrates with the UTF-16-LE Windows APIs, how the internal storage structure is determined at allocation time, how it maintains compatibility with the legacy C extension APIs, etc. The only current widely distributed articles on those topics are written from a perspective that assumes we don't know anything about Unicode, and are just making things unnecessarily complicated (rather than solving hard cross platform compatibility and text processing performance problems). That perspective is incorrect, but "trust me, they're wrong" doesn't work very well with people that are already angry. Text manipulation is one of the most sophisticated subsystems in the interpreter, though, so it's hard to know where to start on such a series (and easy to get intimidated by the sheer magnitude of the work involved in doing it right). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia