[Python-Dev] Patch making the current email package (mostly) support bytes

Tue Oct 5 14:05:33 CEST 2010

On Tue, Oct 5, 2010 at 3:41 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> R. David Murray writes:
>  > Only if the email package contains a coding error would the
>  > surrogates escape and cause problems for user code.
>
> I don't think it is reasonable to internalize surrogates that way;
> some applications *will* want to look at them and do something useful
> with them (delete them or replace them with U+FFFD or ...).  However,
> I argue below that the presence of surrogates already means the user
> code is under fire, and this puts the problem in a canonical form so
> the user code can prepare for it (if that is desirable).

Hang on here, this objection doesn't seem to quite mesh with what RDM
is proposing (and the similar trick I am considering for
urllib.parse).

The basic issue is having an algorithm that is designed to operate on
character data and depends on multiple ASCII constants stored as str
objects.

In Python 2.x, those algorithms could innately operate on str objects
in any ASCII compatible encoding, as well as on unicode objects (due
to the implicit promotion of the ASCII constants to unicode when
unicode input was encountered).

In Py3k, that trick broke. Now those algorithms only operate on str
objects, and bytes input fails, even when it uses an ASCII compatible
encoding.

For urllib.parse, the external API will be "str in -> str out, bytes
in -> bytes out". Whether that is internally implemented by
duplicating all the ASCII constants with both bytes and str flavours
(as my current patch does), or implicitly (and temporarily) "decoding"
the bytes values using ascii+surrogateescape or latin-1 (a pair of
alternative approaches I plan to explore soon) should be completely
transparent to the user of the API. If a user can easily tell which of
these I am doing just through the external behaviour of the documented
API, then I'll have made a mistake somewhere.

My understanding is that email6 in 3.3 will essentially follow that
same model. What I believe RDM is suggesting is an in-between approach
for the 3.2 email module:

- if you pass in bytes data that isn't 7-bit clean and naively use the
str APIs to access the headers, then it will complain loudly if it is
about to return escaped data (but will decode the body in accordance
with the Content Transfer Encoding)
- if you pass in bytes data and know what you are doing, then you can
access that raw bytes data and do your own decoding

I've probably grossly oversimplified what RDM is suggesting, but it
sounds plausible as a useful interim stepping stone to the more
comprehensive type separation in email6.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia