Re: [Mailman-Developers] Google Summer of Code: Integration of Search Code
On 03/27/2012 03:31 AM, Shayan Md wrote:
I was working on mm3. But systers' indexer/searcher was implemented for mailman2. So it must be easy for to integrate it with mm2.
Actually, the systers indexer was designed to work with mboxes (because I had a pile of data in that format that the students could use) but otherwise knows pretty much nothing about mailman 2 or 3. Other than handling mbox instead of maildir, which is only a matter of changing parsers, it shouldn't matter which it's integrated with. This was a design decision at the time, as Mailman 3 was coming but still too incomplete to test with when the code was written.
Looks like archiver for mm3 is still in development stage. As far as I understand searcher depends on the srchiver, right? Not completely but it somewhat depends on archiver. I am not sure if searcher can be implemented without archiver. If possible I can implement for mm3 also.
Searcher and archiver are interdependent *if* we want to share caches and data stores, which we probably do for any installation with larger archives where storing 2 copies vs 4 of each message would make a difference. Plus, many archive views may be basically searches "messages in the last month" "messages which are replies to messageid $foo" etc.
Ideally, anyone working on search will interact heavily with the archiver and probably usability folk at the beginning so that you can figure out what data structures you need to store and index and what use cases you'll need to make fast.
Terri
On Wed, Mar 28, 2012 at 4:21 AM, Terri Oda <terri@zone12.com> wrote:
Looks like archiver for mm3 is still in development stage. As far as I understand searcher depends on the srchiver, right? Not completely but it somewhat depends on archiver. I am not sure if searcher can be implemented without archiver. If possible I can implement for mm3 also.
Searcher and archiver are interdependent *if* we want to share caches and data stores, which we probably do for any installation with larger archives where storing 2 copies vs 4 of each message would make a difference. Plus, many archive views may be basically searches "messages in the last month" "messages which are replies to messageid $foo" etc.
Actually, as far as I can see, the summary/search/index/retrieval functions depend only on the API for the message store. If you want, you can split this into the database layer and a presentation layer, of course. However, the database layer is surely going to have its own schema optimized for the kinds of retrieval its designer considers important. If the designer emphasizes threads, however, she is *not* going to try to store messages in thread order or anything like that. Rather, any reasonable store will be message-ID-addressable.
The only tricky issue is that we *do* have to worry about message-ID collisions of truly different messages and about messages without message IDs, especially for converted historical archives. So the API needs to be able to deal with these issues, probably by returning a set or sequence of messages.
Oh, and we probably ought to have a more general notion of retrievable "object" rather than just messages, as some archive/retrieval backends may store some types of MIME part separately. Hopefully these would be presented to us as MIME parts with external bodies and content IDs.
I would guess she'll probably store messages in YY-MM/MSGID, or as git does in "unpacked" XX/YYYYYYYY... format, where XX are the first two digits of the hash ID, and YY... are the remaining ones). But it could easily be backed by an IMAP store or something more specialized; we don't really care as long as it's object-ID-addressable.
And that's all we want to say about the archiver and the associated message-retrieval logic, I think. (In fact, it occurs to me that maybe we should say "RFC 3501" and be done with it. I don't mean that we necessarily implement IMAP protocol per se, but some subset of its functionality probably is what we need from an archiver.)
Then the schema-specific stuff will use hash IDs to represent message objects in a portable but schema-specific way. As it's schema-specific, I don't really see how data structures can be shared by different searchers.
So I would say not to worry about the archiver side at all. If large installations want to implement specialized message- retrieval, bully for them. But we can go with simple backends, maildir, mbox, and maybe IMAP, I think.
participants (2)
-
Stephen J. Turnbull
-
Terri Oda