[Python-Dev] Patch making the current email package (mostly) support bytes
Stephen J. Turnbull
stephen at xemacs.org
Wed Oct 6 05:22:18 CEST 2010
Nick Coghlan writes:
> - if you pass in bytes data and know what you are doing, then you can
> access that raw bytes data and do your own decoding
At what level, though?
To take an interesting example I used to see frequently:
From: taro at tokyo.jp
(Taro Yamada in 8-bit Shift JIS)
So I guess you are suggesting that the email module can RFC 822 parse
1. Refuse to return the unwrapped (ie, single line) form of the whole
field, except as bytes.
2. Refuse to return the content of the From field, except as bytes.
3. Return the email address parsed from the From field.
4. Refuse to return the comment, except as bytes.
That's fine. But suppose I have a private or newly defined header
that is structured? Now I have two choices:
1. Write a version of my private parser for both str (the normal
case) and bytes (if accessing the value as str raises)
2. Always get the bytes and convert them to str (probably using the
same .decode('ascii','surrogate-escape') call that email uses but
won't let me have the value of!), then use a common str parser.
Note that this is more problematic than it looks, since the
appropriate base codec may require information from higher-level
structures (eg, qp codec tags or a Content-Type header's charset
Why should I reproduce email's logic here? I don't care if the
default or concise API raises on surrogates in the str value. But I'm
pretty sure that I will want to use str values containing surrogates
in these contexts (for the same reasons that email module does, for
example), rather than work with bytes sometimes and strs sometimes.
Please provide a way to return strs-with-surrogates if I ask for them.
More information about the Python-Dev