[Email-SIG] API for email threading library?
Barry Warsaw
barry at python.org
Fri Jan 6 02:21:08 CET 2012
On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:
>Folks, I'm working on an implementation of RFC 5256 email threading,
>designed so that it could fit as a submodule in the "email" package, if
>such a think was ever seen to be useful.
I really like the idea of threading support being included in the email
package. (I admit that I don't have time right now to read the RFC.) My
general thoughts are that the actual messages needn't be included in the
thread collection, but perhaps just Message-IDs. That would allow an
application to store the actual message objects anywhere they want, and would
reduce space requirements of the thread collection.
>I'd like to ask "the wisdom of the crowd" what they think an appropriate
>interface to such a thing would be? The basic operation is that you
>create a collection (type C) of email threads (type T) by passing a set
>of messages (type M) to the constructor.
>
>* Should M be required to be "email.message.Message", or perhaps some
> less restrictive type, say "ThreadableMessageAPI"? All that's
> strictly required is the ability to retrieve the Message-ID, Subject,
> Date, References, and In-Reply-To fields.
I think it would be fine then to allow duck-typing of the input objects. I
don't have a sense of whether it needs a formal (as in Python's ABCs)
interface type.
>* What operations should be possible on C? Some that come to mind:
>
> * retrieve_thread (M or message-id) => T
Message-ID as input.
> * add_message (M) => T
Duck-typed message.
> * add_messages (set of M) => None
> * remove_message (M or message-id) => T (or None) ?
Probably Message-ID as the input. I guess the rule would be that if you need
all the headers you mention above, a duck-typed message would be required.
For operations that only need the Message-ID, just accept that.
And you probably want the full Message-ID header value, e.g. it would include
the angle brackets.
>* What's the interface for T? It's a tree with possible dummy nodes, so
> a tuple of messages plus nested tuples would do it. What should the
> nodes in the tree be? Normalized (see RFC 5256) Message-IDs?
> email.message.Message instances?
Will the tree get mutated when a message is added in the middle of a thread,
or will you generate a new tree? That would make a difference for
tuple-of-tuples or list-of-lists.
I think the nodes would be Message-IDs, but you'd need a public API for
normalizing them, and my application would have to make sure that my messages
are normalized (or at least the lookup keys are) or I might not be able to
find a message given its normalized id. OTOH, maybe the message parser or
message object itself should provide an API for normalizing ids?
Let's think about some use cases.
- given any message, find the entire thread it's a part of
- given a message, find all children
- given a message, find a path to the root of the thread
- find the parts of the thread that fall within a date range
- find the parts of a thread with a matching subject
>* For large sets of threads (millions of messages) a persistence
> mechanism would be useful. Should there be a standard interface to
> such a mechanism, perhaps as class methods on C? If so, what should
> it look like? Should the implementation contain a default persistent
> subclass of C, based on sqlite3? What side-effects would persistence
> requirements have on the other design considerations? For instance,
> would you have to save the entire text of a message for each node?
> Just the headers? Just some of the headers? Just the Message-ID?
Great questions. We've long talked about a persistence mechanism for message
parts (e.g. store the big binary parts on disk instead of in memory). Some
consistency of design would be good here. But I agree that persistence should
definitely be part of the story, and it needs to be plugable.
Have to think more about this, but a big +1 for the idea. It would serve as a
very good component for the ideas I have about a next generation email
archiver.
-Barry
More information about the Email-SIG
mailing list