[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 20:41:23 CET 2008

sorry, this one scrolled off the top, and I didn't read it before 
sending my other reply.

On approximately 11/6/2008 9:02 AM, came the following characters from 
the keyboard of Barry Warsaw:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On Nov 5, 2008, at 6:39 PM, Glenn Linderman wrote:
> 
>> This is an interesting perspective... "stuff em" does come to mind :)
>>
>> But I'm not at all clear on what you mean by a round-trip through the 
>> email module.  Let me see... if you are creating an email, you (1) 
>> should encode it properly (2) a round-trip is mostly meaningless, 
>> unless you send it to yourself.  So you probably mean email that is 
>> received, and that you want to send on.  In this case, there is 
>> already a composed/encoded form of the email in hand; it could simply 
>> be sent as is without decoding or re-encoding.  That would be quite a 
>> clean round-trip!
> 
> There are two ways to create an email DOM.  One is out of whole cloth 
> (i.e. creating Message objects and their subclasses, then attaching them 
> into a tree).  Note that it is a "generator" whose job it is to take the 
> DOM and produce an RFC-compliant flat textural representation.

I grok this one; but think that for the generator, keeping things in 
Unicode until the last minute could be useful.  Maybe not as useful as 
converting immediately to bytes, though, to reduce the amount of 
duplicated code.

> The other way to get a DOM is to parse some flat textual 
> representation.  In this case, it is a core design requirement that the 
> parser never throws an exception, and that there is a way to record and 
> retrieve the defects in a message.

Sure, this makes sense.  My other message suggested keeping the message 
flat, and using cached pointers and lengths.  Of course, editing with 
such a technique could be a problem, because the pointers would have to 
be updated.  A MIME-mimicking tree of flat subchunks comes to mind...

> The core model objects of Message (and their MIME subclasses) and Header 
> should treat everything internally as bytes.  The edges are where you 
> want to be able to accept varying types, but always convert to bytes 
> internally.  Edges of this system include the parser, the generator, and 
> various setter and getter methods of Message and Header.
> 
> The current code has a strong desire to be idempotent, so that 
> parser->DOM->generator output is exactly the same as input.  Small 
> changes to the DOM or content in between should have minimal effect.  
> For example, if you delete a header and then add it back, the header 
> will show up at the end of the RFC 2822 header list, but everything else 
> about the message will be unchanged.

Ah, this is your definition of idempotent!  Which is what I expected, 
but wasn't sure.

This is reasonable.  One _could_ even convince the header to show up in 
the original spot, if you keep a NULL header placeholder around for 
deleted headers.... that would vanish only when regenerating.

> Currently idempotency is broken for defective messages.  The generator 
> is guaranteed to produce RFC-compliant output, repairing defects like 
> missing boundaries and such.

So it seems you are happy with this level of "fixing" things?

>> I guess I'm not terribly concerned about the readability of improperly 
>> encoded email messages, whether they are spam or ham.  For the 
>> purposes of SpamBayes (which I assume is similar to spamassassin, only 
>> written in Python), it doesn't matter if the data is readable, only 
>> that it is recognizably similar.  So a consistent mis-transliteration 
>> is as good a a correct decoding.
> 
> The key thing is that parse should never ever raise an exception.  We've 
> learned the hard way that this is the most practical thing because at 
> the level most parsing happens, you really cannot handle any errors.

So you don't have a goal to make mangled, multi-character encodings 
suddenly be readable via the email lib?  Only to provide the data in raw 
form, so that Mr. Turnbull can implement that on top, in emacs?

>> For ham, the correspondent should be informed that there are problems 
>> with their software, so that they can upgrade or reconfigure it.
> 
> That's a practical impossibility in real-world applications, as is 
> simply discarding malformed messages.  Email sucks.

I agree it is impossible to do that automatically.  But if a 
correspondent suddenly gets broken software, I attempt to inform them of 
that... and as long as their email address comes through, I can...

And I don't think I've ever proposed discarding malformed messages; just 
transliterating them in some way that (drum roll) doesn't cause 
exceptions...

Sorry I wrote a bit before looking at the API, which is more robust than 
I expected, from Mr. Turnbull's writings.  I am curious what the list of 
API deficiencies that have been determined are... is there a list somewhere?

My summary tried to be a start on that, or an augmentation.  Seems I 
tried to get to bug# last night, but the 'net wasn't responsive.  Can't 
find the number now, in a quick look through the messages in this thread.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking