[Email-SIG] fixing the current email module

Glenn Linderman v+python at g.nevcal.com
Thu Oct 8 23:59:38 CEST 2009


On approximately 10/8/2009 6:00 AM, came the following characters from 
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:
>> The application options are to drop the attachment, or pass through 
>> the corrupted bytes, and let the next application try to make sense 
>> of it.
>
> Exactly, and it's not for the email package to say which is right.
>
> Here's a use case: I've got a Message that was parsed from wire input 
> and I want to mangle the Subject heading to add the list prefix.  I 
> know exactly what charset the prefix is in because that's data I 
> control.  When I ask for the original Subject value, I'm handed an 
> instance that I can use to try to figure out how add the prefix.
>
> First thing I'll ask it is "are you a single chunk in my prefix 
> charset (or compatible)?"  If so, I can probably just prepend my 
> prefix onto the value.  If not, "are you composed of multiple valid 
> chunks in different charsets?"  If so, I know that I need to encode my 
> prefix, but I can still prepend it to the header value (hopefully 
> using the same API, and I don't care that the implementation could not 
> use string concatenation).
>
> If not, then what?  Maybe I don't care if some of the chunk charsets 
> aren't known because I can still use the right encode+prepend 
> strategy.  But if the header is a gobbledegook of 8-bit bytes?  I'm 
> pretty sure I want to be able to ask the API if that's the case rather 
> than get an exception.  The thing I'm not so sure about is what 
> happens if my application is just naive enough to just ask for the 
> header as a unicode and that conversion can't be made.  I /think/ it 
> should raise an exception in that case.  But then when I ask for the 
> header value as a mass of bytes, that should succeed and return me the 
> raw input. 

So for this use case, it is known that all headers are ASCII.  So the 
operation of prepending a list prefix should not care whether the 
Subject: value is valid or not... it can simply prepend the list prefix, 
followed by SP, to the existing, raw header that already exists.

The only remaining issue is line length limits, so maybe it has to use 
CR LF TAB instead of space, sometimes.

OK, so if the prefix is not ASCII, it gets separately encoded, including 
a trailing SP, and then prepended to the value followed by SP or CR LF 
TAB depending on the line length limit.

So to prepend into a text header, you shouldn't need to decode the 
undecodable... there should be a prepend (and possibly also an append) 
operation provided by the API, so that applications can tweak headers 
without decoding.  This allows useful behavior even if new methods of 
encoding are invented that are not yet understood by a particular 
version of the email library.

Asking for the header value (or whole header) in Unicode should decode 
the chunks that are understandable and decodable, and leave the chunks 
that are not understandable as 
ASCII-converted-to-Unicode-but-still-possibly-weirdly-encoded ... I 
think that is what the RFCs encourage.

Asking for a header as bytes should return the wire data, if it is 
available, or an encoding of real data as wire data (like generate would 
do).  There is no Unicode that cannot be encoded to wire format, IIUC, 
usually via a variety of heuristics once non-ASCII characters are 
included, that may produce a variety of differing results, all of which 
should decode back to the original data.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



More information about the Email-SIG mailing list