Nick Coghlan writes:
The current proposal on the issue tracker is to instead take advantage of the existing error handlers:
def convert_surrogateescape(data, errors='replace'): return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)
That code is short, but semantically dense
And it doesn't implement your original suggestion of replacement with '?' (and another possibility for history buffs is 0x1A, ASCII SUB). At least, AFAICT from the docs there's no way to specify the replacement character; decoding always uses U+FFFD. (If I knew how to do that, I would have suggested this.)
(Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions.
I'm not sure that's true. At least, to me that code was obvious -- I got the exact definition (except for the function name) on the first try -- but I ruled it out because it didn't implement your suggestion of replacement with '?', even as an option. OTOH, I think a lot of the resistance to codec-based solutions is the misconception that en/decoding streams is expensive, or the misconception that Python's internal representation of text as an array of code points (rather than an array of "characters" or "grapheme clusters") is somehow insufficient for text processing. Steve