[Python-Dev] Bytes path related questions for Guido

Sun Aug 24 16:23:52 CEST 2014

Le 24/08/2014 09:04, Nick Coghlan a écrit :
> On 24 August 2014 14:44, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> 2. Should we add some additional helpers to the string module for
>> dealing with surrogate escaped bytes and other techniques for
>> smuggling arbitrary binary data as text?
>>
>> My proposal [3] is to add:
>>
>> * string.escaped_surrogates (constant with the 128 escaped code points)
>> * string.clean(s): replaces surrogates with '\ufffd' or another
>> specified code point
>> * string.redecode(s, encoding): encodes a string back to bytes and
>> then decodes it again using the specified encoding (the old encoding
>> defaults to 'latin-1' to match the assumptions in WSGI)
>
>
> Serhiy & Ezio convinced me to scale this one back to a proposal for
> "codecs.clean_surrogate_escapes(s)", which replaces surrogates that
> may be produced by surrogateescape (that's what string.clean() above
> was supposed to be, but my description was not correct, and the name
> was too vague for that error to be obvious to the reader)

"clean" conveys the wrong meaning. It should use a scary word such as 
"trap". "Cleaning" surrogates is unlikely to be the right procedure when 
dealing with surrogates produced by undecodable byte sequences.

Regards

Antoine.