[Mailman-Developers] Google Summer of Code: Integration of Search Code

Bill Janssen janssen at parc.com
Thu Mar 29 03:06:31 CEST 2012


Stephen J. Turnbull <stephen at xemacs.org> wrote:

> On Wed, Mar 28, 2012 at 4:21 AM, Terri Oda <terri at zone12.com> wrote:
> 
> >> Looks like archiver for mm3 is still in development stage. As far as I
> >> understand searcher depends on the srchiver, right? Not completely but it
> >> somewhat depends on archiver. I am not sure if searcher can be implemented
> >> without archiver. If possible I can implement for mm3 also.
> >
> > Searcher and archiver are interdependent *if* we want to share caches and
> > data stores, which we probably do for any installation with larger archives
> > where storing 2 copies vs 4 of each message would make a difference.  Plus,
> > many archive views may be basically searches "messages in the last month"
> > "messages which are replies to messageid $foo" etc.
> 
> Actually, as far as I can see, the summary/search/index/retrieval
> functions depend only on the API for the message store.  If you
> want, you can split this into the database layer and a presentation
> layer, of course.  However, the database layer is surely going to
> have its own schema optimized for the kinds of retrieval its
> designer considers important.  If the designer emphasizes
> threads, however, she is *not* going to try to store messages in
> thread order or anything like that.  Rather, any reasonable store
> will be message-ID-addressable.

Right.  UpLib has a 'message-store', which the threading code interacts
with to generate threads as data referring to document IDs.  The
message-store API can take both message-IDs or UpLib document IDs and
resolve them.

> The only tricky issue is that we *do* have to worry about
> message-ID collisions of truly different messages and about
> messages without message IDs, especially for converted
> historical archives.  So the API needs to be able to deal
> with these issues, probably by returning a set or sequence
> of messages.

Right.  UpLib takes a message and creates multiple 'documents' (one for
the message, and one for each attachment), each of which have their own
unique 'doc ID', the assigned UpLib ID.  In addition, the email is
assigned a 'mail-guid', which is calculated from some of the header
information and may also include the doc ID.  The metadata of each
attachment refers back to the 'mail-guid' of the message it was part of.

Message-ID, mail-guid, and document ID are all separately indexed for
each document, and any of them can be searched on.

> Oh, and we probably ought to have a more general notion
> of retrievable "object" rather than just messages, as some
> archive/retrieval backends may store some types of MIME
> part separately.  Hopefully these would be presented to
> us as MIME parts with external bodies and content IDs.

Here's how I do it.  In UpLib, a multipart email is analyzed into a
message plus possible attachments.  The parts that are the 'message' are
unified and presented as a document.  The parts that are attachments are
broken out and processed as independent documents, iconified links to
which are then put back into the 'message' document.

See http://uplib.parc.com/misc/noguchi.png for an example of the UpLib
reader, ReadUp, showing a plain-text email with an attached PDF file.
Most of the things that can be links there (like "Reply" or the email
addresses or my name or the URL or the attachment icon and name) are in
fact links.

> And that's all we want to say about the archiver and the
> associated message-retrieval logic, I think.  (In fact, it occurs to
> me that maybe we should say "RFC 3501" and be done with
> it.  I don't mean that we necessarily implement IMAP protocol
> per se, but some subset of its functionality probably is what we
> need from an archiver.)

Yes, there's an IMAP server that runs in UpLib, and can export any
document via IMAP (including archived email).  Though it currently
doesn't scale well; I need to re-write it with Tornado, too.

Bill


More information about the Mailman-Developers mailing list