[Email-SIG] API for email threading library?
Bill Janssen
janssen at parc.com
Wed Jan 11 20:00:39 CET 2012
Here's what I've got so far. Comments would be appreciated.
Bill
======================================================================
This module implements email threading per RFC 5256.
It provides four classes: ThreadableObjectStore, MailboxStore,
ReferencesSet, and OrderedSubjectSet.
To use it, you need to provide it with a "mailstore", and a set of
messages to thread. The mailstore must be a subclass of the
abstract class ThreadableObjectStore; an implementation of a
ThreadableObjectStore for mailbox.Mailbox is provided, as the class
MailboxStore. Four methods must be implemented for a new
ThreadableObjectStore subclass:
tos_get_message_id(msg or message ID) => message ID
where the message ID is an immutable value that must be unique in
that ThreadableObjectStore context, and the msg can be whatever
that ThreadableObjectStore considers a message.
tos_get_subject(msg or message ID) => subject
where the subject is the subject of the message, or None
tos_get_date (msg or message ID) => timestamp
where the timestamp is the date and time of the message, expressed
as a standard Python time.time() value
tos_get_references (msg or message ID) => sequence of message ID
where the references are a sequence of message IDs, arranged in
order as per RFC 5322. These message IDs must be in the same
format as the message ID returned by tos_get_message_id().
The base ThreadableObjectStore class also provides a class method to
compute the RFC 5256 "base subject":
ThreadableObjectStore.tos_base_subject (subject text) => \
subject, is_reply_or_forward
Takes a standard Subject: header value, and returns the "base
subject" for it, along with a boolean flag indicating whether the
supplied subject indicated a reply to or forward of the original
subject
To develop a set of threads, you then instantiate either ReferencesSet
(the JWS algorithm from Netscape, formalized in RFC 5256), or
OrderedSubjectSet (the "same subjects" algorithm, aka "poor man's
threading"), both subclasses of the abstract class ThreadSet. Each
constructor takes a ThreadableObjectStore instance and optionally a
set of messages to use for the initial threads. If provided, those
messages are analyzed into a set of threads. The threadset is
iterable; the iteration is over the threads it contains.
An instance of ThreadSet provides the following methods:
add (msg or message ID) => thread
add another message from the mailstore to the thread set, where
"thread" is an object which has the attributes "message_id" (a
string) and "children" (an ordered list of sub-threads), and is
the root of the thread tree for that msg.
remove (msg or message ID) => thread
remove a message from the thread set, where thread is as for
"add()", but may additionally be 'None' if the message was not in
a thread, or was the only message in the thread.
thread (msg or message ID) => thread
obtain the thread containing the specified message, if any,
where "thread" is as for "add()", or 'None' if no thread for
that message exists.
subject_threads (subject regexp) => set of thread
obtain the threads where the base subject of the thread contains
the specified regular expression, where "regexp" is a textual or
compiled regular expression, and the return value is a set of
threads. Note that subject comparisons are case-insensitive;
compiled regexps must use the re.IGNORECASE flag.
date_threads (starting time, ending time, root_only=False) => set of thread
obtain the set of threads containing any messages between
the two timestamps. Timestamps are time.time() timestamps;
either may be specified as 'None' to mean either the start
of time, or the distant future, respectively. If "root_only"
is specified, will only consider the dates of the roots of
each thread; threads with no root message (a subject forest)
will always fail to match in this case.
__contains__ (msg or message ID) => boolean
Present to support the "in" operator.
Support for persistence is provided with an instance method
"to_external_form" and a class method "from_external_form" on thread
sets. Calling "to_external_form" on a thread set instance will
generate a set of tree structured nested tuples, where each tuple
consists of an optional message ID followed by zero or more child
tuples. ReferencesSet and OrderedSubjectSet also provide a class
method, "from_external_form", which given a ThreadableObjectStore
instance and an externalized thread set value, will create and return
a new thread set instance initialized to that set of threads.
MailboxStore is a subclass of ThreadableObjectStore designed to
wrap mailboxes (subclasses of mailbox.Mailbox). For instance,
>>> mbox = mailbox.Mbox("foo.mbox")
>>> mboxstore = MailboxStore(mbox)
>>> threadset = ReferencesSet (mboxstore, mbox.itervalues())
will produce a thread set for all the messages in the mbox-format
mailbox 'foo.mbox', using the REFERENCES threading algorithm.
MailboxStore also provides a static method to compute the normalized
form of a message ID (the message ID stripped of <> angle brackets,
and various quoted parts unquoted):
MailboxStore.normalize_message_id(message ID) => message ID
Take a standard RFC 5322 message ID string and return the
normalized form of it.
More information about the Email-SIG
mailing list