[Python-Dev] Patch making the current email package (mostly) support bytes
R. David Murray
rdmurray at bitdance.com
Wed Oct 6 18:18:03 CEST 2010
On Wed, 06 Oct 2010 12:22:18 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
>
> > - if you pass in bytes data and know what you are doing, then you can
> > access that raw bytes data and do your own decoding
>
> At what level, though?
>
> To take an interesting example I used to see frequently:
>
> From: taro at tokyo.jp
> (Taro Yamada in 8-bit Shift JIS)
>
> So I guess you are suggesting that the email module can RFC 822 parse
> that, and
>
> 1. Refuse to return the unwrapped (ie, single line) form of the whole
> field, except as bytes.
> 2. Refuse to return the content of the From field, except as bytes.
> 3. Return the email address parsed from the From field.
> 4. Refuse to return the comment, except as bytes.
5. Return the content, with non-ASCII bytes replaced with ?
characters.
In other words, my proposed patch only makes email5 1/8 to 1/4
broken, instead of half broken as it is now. But not un-broken
enough for Mailman, it sounds like.
> That's fine. But suppose I have a private or newly defined header
> that is structured? Now I have two choices:
>
> 1. Write a version of my private parser for both str (the normal
> case) and bytes (if accessing the value as str raises)
>
> 2. Always get the bytes and convert them to str (probably using the
> same .decode('ascii','surrogate-escape') call that email uses but
> won't let me have the value of!), then use a common str parser.
Yes, this is exactly the dilemma faced by the entire email package.
The current email6 code attempts to do a variation on (1) by having a
common parser that handles both strings and bytes using a dual subclass
approach. This patch is trying out (2). If you have a private header
parser, you would ideally like to be able to use the same mechanism as the
email package to solve the problem. For email6 you'd be able to register
your header parser and get handed the input like the built in parser and
be able to use the tools provided by the built in parser to do your work.
In email5 there is no way that I know of for you to register a private
parser, so you need access to the raw input for the header in one form
or another.
If we go this route (as opposed to only handling headers with 8bit data by
sanitizing them), then we need to think about the email5 header parsers
as well (decode_header and parseaddr). They are of course going to have
the same problems as the rest of the email package with parsing bytes,
and you are suggesting that access to those header 8bit bytes is needed.
One option would be to add a keyword to the get and get_all methods
that instructs it to return the string with the surrogate-escaped
bytes, which can then be passed onward to decode_header, parseaddr,
or a custom decoder. Then I need to look at what needs to be added to
those methods to handle the escaped bytes, and from what you say they
too need a keyword telling them to preserve the escaped bytes on output
(a "yes I know what I'm doing" flag...'preserve_escaped_bytes=True'?).
> Note that this is more problematic than it looks, since the
> appropriate base codec may require information from higher-level
> structures (eg, qp codec tags or a Content-Type header's charset
> field).
You'll have to give me an example of where this is a problem but is
not already a problem in email4.
> Why should I reproduce email's logic here? I don't care if the
> default or concise API raises on surrogates in the str value. But I'm
> pretty sure that I will want to use str values containing surrogates
> in these contexts (for the same reasons that email module does, for
> example), rather than work with bytes sometimes and strs sometimes.
>
> Please provide a way to return strs-with-surrogates if I ask for them.
Does my proposal make sense? But note, it raises exactly the backward
compatibility concerns you mention in your next email (that I will reply
to next). It is an open question whether it is worth opening that door
in order to be able to do extended handling on non-RFC conforming email
(as opposed to just sanitizing it and soldering on).
--
R. David Murray www.bitdance.com
More information about the Python-Dev
mailing list