[Python-Dev] Cleaning up surrogate escaped strings (was Bytes path related questions for Guido)

Thu Aug 28 14:26:16 CEST 2014

On 26 Aug 2014 21:34, "MRAB" <python at mrabarnett.plus.com> wrote:
>
> On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>
>> Nick Coghlan writes:
>>
>>   > "purge_surrogate_escapes" was the other term that occurred to me.
>>
>> "purge" suggests removal, not replacement.  That may be useful too.
>>
>> neutralize_surrogate_escapes(s, remove=False, replacement='\uFFFD')
>>
> How about:
>
>     replace_surrogate_escapes(s, replacement='\uFFFD')
>
> If you want them removed, just pass an empty string as the replacement.

The current proposal on the issue tracker is to instead take advantage of
the existing error handlers:

    def convert_surrogateescape(data, errors='replace'):
        return data.encode('utf-8', 'surrogateescape').decode('utf-8',
errors)

That code is short, but semantically dense - it took a few iterations to
come up with that version. (Added bonus: once you're alerted to the
possibility, it's trivial to write your own version for existing Python 3
versions. The standard name just makes it easier to look up when you come
across it in a piece of code, and provides the option of optimising it
later if it ever seems worth the extra work)

I also filed a separate RFE to make backslashreplace usable on input, since
that allows the option of separating the replacement operation from the
encoding operation.

Cheers,
Nick.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140828/97f55de8/attachment.html>