[Python-Dev] Bytes path related questions for Guido

Sun Aug 24 15:04:31 CEST 2014

On 24 August 2014 14:44, Nick Coghlan <ncoghlan at gmail.com> wrote:
> 2. Should we add some additional helpers to the string module for
> dealing with surrogate escaped bytes and other techniques for
> smuggling arbitrary binary data as text?
>
> My proposal [3] is to add:
>
> * string.escaped_surrogates (constant with the 128 escaped code points)
> * string.clean(s): replaces surrogates with '\ufffd' or another
> specified code point
> * string.redecode(s, encoding): encodes a string back to bytes and
> then decodes it again using the specified encoding (the old encoding
> defaults to 'latin-1' to match the assumptions in WSGI)

Serhiy & Ezio convinced me to scale this one back to a proposal for
"codecs.clean_surrogate_escapes(s)", which replaces surrogates that
may be produced by surrogateescape (that's what string.clean() above
was supposed to be, but my description was not correct, and the name
was too vague for that error to be obvious to the reader)

"s != codecs.clean_surrogate_escapes(s)" would then become the check
for "does this string contain any surrogate escaped bytes?"

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia