[Mailman-Developers] Google Summer of Code: Integration of Search Code

Wed Mar 28 03:29:45 CEST 2012

On Wed, Mar 28, 2012 at 4:21 AM, Terri Oda <terri at zone12.com> wrote:

>> Looks like archiver for mm3 is still in development stage. As far as I
>> understand searcher depends on the srchiver, right? Not completely but it
>> somewhat depends on archiver. I am not sure if searcher can be implemented
>> without archiver. If possible I can implement for mm3 also.
>
> Searcher and archiver are interdependent *if* we want to share caches and
> data stores, which we probably do for any installation with larger archives
> where storing 2 copies vs 4 of each message would make a difference.  Plus,
> many archive views may be basically searches "messages in the last month"
> "messages which are replies to messageid $foo" etc.

Actually, as far as I can see, the summary/search/index/retrieval
functions depend only on the API for the message store.  If you
want, you can split this into the database layer and a presentation
layer, of course.  However, the database layer is surely going to
have its own schema optimized for the kinds of retrieval its
designer considers important.  If the designer emphasizes
threads, however, she is *not* going to try to store messages in
thread order or anything like that.  Rather, any reasonable store
will be message-ID-addressable.

The only tricky issue is that we *do* have to worry about
message-ID collisions of truly different messages and about
messages without message IDs, especially for converted
historical archives.  So the API needs to be able to deal
with these issues, probably by returning a set or sequence
of messages.

Oh, and we probably ought to have a more general notion
of retrievable "object" rather than just messages, as some
archive/retrieval backends may store some types of MIME
part separately.  Hopefully these would be presented to
us as MIME parts with external bodies and content IDs.

I would guess she'll probably store messages in
YY-MM/MSGID, or as git does in "unpacked"
XX/YYYYYYYY... format, where XX are the first two digits
of the hash ID, and YY... are the remaining ones).  But it
could easily be backed by an IMAP store or something
more specialized; we don't really care as long as it's
object-ID-addressable.

And that's all we want to say about the archiver and the
associated message-retrieval logic, I think.  (In fact, it occurs to
me that maybe we should say "RFC 3501" and be done with
it.  I don't mean that we necessarily implement IMAP protocol
per se, but some subset of its functionality probably is what we
need from an archiver.)

Then the schema-specific stuff will use hash IDs to represent
message objects in a portable but schema-specific way.  As
it's schema-specific, I don't really see how data structures
can be shared by different searchers.

So I would say not to worry about the archiver side at all.  If
large installations want to implement specialized message-
retrieval, bully for them.  But we can go with simple backends,
maildir, mbox, and maybe IMAP, I think.