[Email-SIG] API for email threading library?

Fri Jan 6 03:49:00 CET 2012

Thanks for the feedback, Barry.

Barry Warsaw <barry at python.org> wrote:

> On Jan 05, 2012, at 09:55 AM, Bill Janssen wrote:
> 
> >Folks, I'm working on an implementation of RFC 5256 email threading,
> >designed so that it could fit as a submodule in the "email" package, if
> >such a think was ever seen to be useful.
> 
> I really like the idea of threading support being included in the email
> package.  (I admit that I don't have time right now to read the RFC.)

It basically defines two kinds of threading for IMAP:  ORDEREDSUBJECT,
which is "poor man's threading" using "Subject" and "Date" headers, and
REFERENCES, which is JWS threading a la Netscape using "References" and
"In-Reply-To" headers.  I intend to support both.

> My general thoughts are that the actual messages needn't be included in the
> thread collection, but perhaps just Message-IDs.  That would allow an
> application to store the actual message objects anywhere they want, and would
> reduce space requirements of the thread collection.

We need "Subject", "Date", and either "References" or "In-Reply-To", in
addition to "Message-ID", in order to add a new message to the thread
DB.  I was planning to use a struct with slots containing hashes of the
value of each of these as the internal node structure in the thread-set
instance.

If the message objects were available on-demand (perhaps via a weakref
or via a Message-ID to message mapping), we could save only a pointer to
the message.  Perhaps a "retrieve-message-by-message-id" callback object
should be a parameter to the constructor.

> >I'd like to ask "the wisdom of the crowd" what they think an appropriate
> >interface to such a thing would be?  The basic operation is that you
> >create a collection (type C) of email threads (type T) by passing a set
> >of messages (type M) to the constructor.
> >
> >* Should M be required to be "email.message.Message", or perhaps some
> >  less restrictive type, say "ThreadableMessageAPI"?  All that's
> >  strictly required is the ability to retrieve the Message-ID, Subject,
> >  Date, References, and In-Reply-To fields.
> 
> I think it would be fine then to allow duck-typing of the input objects.  I
> don't have a sense of whether it needs a formal (as in Python's ABCs)
> interface type.

I prefer an ABC as documentation of the duck-typing requirements.  I'm
thinking a subtype of email.message.Message would be good -- basically
adding the contraint that the "Message-ID", "Subject", "Date", and
"References" (or "In-Reply-To" headers) be set, but not requiring any
payload.

> >* What operations should be possible on C?  Some that come to mind:
> >
> >  * retrieve_thread (M or message-id) => T
> 
> Message-ID as input.
> 
> >  * add_message (M) => T
> 
> Duck-typed message.
> 
> >  * add_messages (set of M) => None
> >  * remove_message (M or message-id) => T (or None) ?
> 
> Probably Message-ID as the input.  I guess the rule would be that if you need
> all the headers you mention above, a duck-typed message would be required.

For "add", but not "remove".

> For operations that only need the Message-ID, just accept that.

Sure.  Either/or.

> And you probably want the full Message-ID header value, e.g. it would include
> the angle brackets.

Easier to get at.

> >* What's the interface for T?  It's a tree with possible dummy nodes, so
> >  a tuple of messages plus nested tuples would do it.  What should the
> >  nodes in the tree be?  Normalized (see RFC 5256) Message-IDs?
> >  email.message.Message instances?
> 
> Will the tree get mutated when a message is added in the middle of a thread,
> or will you generate a new tree?  That would make a difference for
> tuple-of-tuples or list-of-lists.

It would be mutated internally, but the thread given back to callers
would be an immutable copy of the internal tree.  I was thinking that
the returned thread would be a fresh tuple containing (<root>
<children>...), where <root> is a message ID, and each <children> is a
fresh tuple of the same form as <msgs>.

Thus the tree

  A --+-- B
      |
      +-- C --+-- D
              |
              +-- E
              |
              +-- F

would look like this:

  ('A at example.com'
     ('B at example.com')
     ('C at example.com'
        ('D at example.com')
        ('E at example.com')
        ('F at example.com'))))

though perhaps

  ('A at example.com'
     'B at example.com'
     ('C at example.com'
        'D at example.com'
        'E at example.com'
        'F at example.com'))

would be more efficient -- each child is either a singleton represented
by a string message-id, or a tuple of a reply plus its children.

> I think the nodes would be Message-IDs, but you'd need a public API for
> normalizing them, and my application would have to make sure that my messages
> are normalized (or at least the lookup keys are) or I might not be able to
> find a message given its normalized id.  OTOH, maybe the message parser or
> message object itself should provide an API for normalizing ids?

The normalization of the Message-ID in RFC 5256 refers to the optional
quoting allowed in RFC 2822, in which '<"01KF8JCEOCBS0045PS"@xxx.yyy.com>'
and '<01KF8JCEOCBS0045PS at xxx.yyy.com>' and
and '<"01KF8JCEOCBS0045PS"@[xxx.yyy.com]>' and
'<01KF8JCEOCBS0045PS@[xxx.yyy.com]>' are all the same message ID, the
normalized form of which is '01KF8JCEOCBS0045PS at xxx.yyy.com'.

Might be useful to have a method or property on email.message.Message to
retrieve this value.

I'd certainly want to normalize any message-IDs passed in as keys.

> Let's think about some use cases.
> 
> - given any message, find the entire thread it's a part of
> - given a message, find all children
> - given a message, find a path to the root of the thread
> - find the parts of the thread that fall within a date range

Interesting, hadn't thought about that one.  Good idea.

> - find the parts of a thread with a matching subject

Hmmm.  Using ORDEREDSUBJECT, all of the parts of a thread have the same
"base subject" -- which is another thing defined in RFC 5256.  It's
basically the subject of the message with any "Re:" or "Fwd:" or
"[mailman-listname]" stuff trimmed off.  The ORDEREDSUBJECT algorithm
basically collects all messages with the same "base subject" and sorts
them by date.

"Base subject" would be another good thing to add to email.util or
email.message.Message, by the way.

In the REFERENCES algorithm, threads with the same base subject are
merged, but I suppose threads where someone replied to an earlier
message, but with a different subject line, would allow multiple base
subjects per thread.  Perhaps such threads should be split apart?

> >* For large sets of threads (millions of messages) a persistence
> >  mechanism would be useful.  Should there be a standard interface to
> >  such a mechanism, perhaps as class methods on C?  If so, what should
> >  it look like?  Should the implementation contain a default persistent
> >  subclass of C, based on sqlite3?  What side-effects would persistence
> >  requirements have on the other design considerations?  For instance,
> >  would you have to save the entire text of a message for each node?
> >  Just the headers?  Just some of the headers?  Just the Message-ID?
> 
> Great questions.  We've long talked about a persistence mechanism for message
> parts (e.g. store the big binary parts on disk instead of in memory).  Some
> consistency of design would be good here.  But I agree that persistence should
> definitely be part of the story, and it needs to be plugable.
> 
> Have to think more about this, but a big +1 for the idea.  It would serve as a
> very good component for the ideas I have about a next generation email
> archiver.

Yes, I intend to use it for UpLib (http://uplib.parc.com/), which is
what I use to archive my many years of email.  But I thought it would be
more generally useful for others if I wrote it to work with the more
stdlib email package.

Bill