[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Thu Oct 8 23:59:38 CEST 2009
On approximately 10/8/2009 6:00 AM, came the following characters from
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:
>> The application options are to drop the attachment, or pass through
>> the corrupted bytes, and let the next application try to make sense
>> of it.
>
> Exactly, and it's not for the email package to say which is right.
>
> Here's a use case: I've got a Message that was parsed from wire input
> and I want to mangle the Subject heading to add the list prefix. I
> know exactly what charset the prefix is in because that's data I
> control. When I ask for the original Subject value, I'm handed an
> instance that I can use to try to figure out how add the prefix.
>
> First thing I'll ask it is "are you a single chunk in my prefix
> charset (or compatible)?" If so, I can probably just prepend my
> prefix onto the value. If not, "are you composed of multiple valid
> chunks in different charsets?" If so, I know that I need to encode my
> prefix, but I can still prepend it to the header value (hopefully
> using the same API, and I don't care that the implementation could not
> use string concatenation).
>
> If not, then what? Maybe I don't care if some of the chunk charsets
> aren't known because I can still use the right encode+prepend
> strategy. But if the header is a gobbledegook of 8-bit bytes? I'm
> pretty sure I want to be able to ask the API if that's the case rather
> than get an exception. The thing I'm not so sure about is what
> happens if my application is just naive enough to just ask for the
> header as a unicode and that conversion can't be made. I /think/ it
> should raise an exception in that case. But then when I ask for the
> header value as a mass of bytes, that should succeed and return me the
> raw input.
So for this use case, it is known that all headers are ASCII. So the
operation of prepending a list prefix should not care whether the
Subject: value is valid or not... it can simply prepend the list prefix,
followed by SP, to the existing, raw header that already exists.
The only remaining issue is line length limits, so maybe it has to use
CR LF TAB instead of space, sometimes.
OK, so if the prefix is not ASCII, it gets separately encoded, including
a trailing SP, and then prepended to the value followed by SP or CR LF
TAB depending on the line length limit.
So to prepend into a text header, you shouldn't need to decode the
undecodable... there should be a prepend (and possibly also an append)
operation provided by the API, so that applications can tweak headers
without decoding. This allows useful behavior even if new methods of
encoding are invented that are not yet understood by a particular
version of the email library.
Asking for the header value (or whole header) in Unicode should decode
the chunks that are understandable and decodable, and leave the chunks
that are not understandable as
ASCII-converted-to-Unicode-but-still-possibly-weirdly-encoded ... I
think that is what the RFCs encourage.
Asking for a header as bytes should return the wire data, if it is
available, or an encoding of real data as wire data (like generate would
do). There is no Unicode that cannot be encoded to wire format, IIUC,
usually via a variety of heuristics once non-ASCII characters are
included, that may produce a variety of differing results, all of which
should decode back to the original data.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list