[Python-ideas] Processing surrogates in

Thu May 14 10:20:28 CEST 2015

On 14 May 2015 at 15:31, Serhiy Storchaka <storchaka at gmail.com> wrote:
> On 13.05.15 19:22, Nick Coghlan wrote:
>>
>> Three potential expected sources of surrogates have been identified:
>>
>> * escaped surrogates smuggling arbitrary bytes passed through decoding
>> by the "surrogateescape" error handler
>> * surrogates passed through the decoding process by the
>> "surrogatepass" error handler
>> * decomposed surrogate pairs for astral characters
>
>
> * json
> * pickle
> * email
> * nntplib
> * SimpleHTTPRequestHandler
> * wsgiref
> * cgi
> * tarfile
> * filesystem names (os.decode) and other os calls
> * platform and sysconfig
> * other serializers

Right, those are the kinds of boundary APIs that drove the
introduction of Python 3's arbitrary bytes smuggling capabilities in
the first place.

The key changes I realised it's potentially worth waiting and seeing
the impact of are:

* the restoration of printf-style formatting for binary data
* the introduction of bytes.hex()
* the rise of systemd as the preferred init system for Linux (while
that doesn't solve the "bad locale settings" problem for *nix systems,
it tackles a reasonable chunk of them)

The first two should make it easier to just stay in the binary domain
when working with arbitrary binary data, while the last will hopefully
eliminate one of the common sources of declared-vs-actual encoding
mismatches.

I *expect* we'll still want these proposed APIs (or a comparable
alternative) by the time 3.6 rolls around, but I also see value in
continuing to be cautious about adding them (since we'll be stuck with
them once we do, although I guess we could also go down the path of
declaring "string.internals" to be a provisional API in PEP 411
terms).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia