[Mailman-Developers] Idea
John Viega
viega@list.org
Sat, 13 Jun 1998 15:50:53 -0700
On Sat, Jun 13, 1998 at 05:55:41PM -0400, Corbett J. Klempay wrote:
>
> - how big of an archive should this be scalable to?
I'd say as big as possible. I don't know if I can give a better
answer to that one. Target the most heavily trafficed mailing list
you've been on.
> I'm thinking this because some models (like the vector model used for my
> project) get good accuracy, but suck as far as resource usage (like our
> engine dealt with ~2000 text documents from an online database and would
> suck up ~80 MB of RAM per query, and would take almost 20 seconds just to
> load the pre-indexed document clusters from disk; the query itself only
> takes like 1-2 seconds on a K6-233, but the startup time of 15+ seconds
> blows). With large corpa, it might be necessary to implement some
> persistence; it just takes so annoyingly long to load even pre-indexed
> stuff from disk (and our queries were on a K6-233 with 128 MB;
> heheh...think how a P-90 with 32 MB would fare :) (or worse yet!)
Well, you have several options. You could keep a persistant server
up, but I wouldn't make it a requirement, perhaps an option. I think
that if complex search capabilities aren't desired, the grep libraries
would be an OK first pass.
> - What kind of structure does Pipermail store the archive in?
This is the biggest problem right now. Andrew says you can plug in
any sort of back-end you want for a database, as long as it can handle
a tree-type structure. Unfortunately, the only such backend
implemented is not portable. Everything needs to work out of the box
w/ a vanilla Python installation. I was thinking that someone could
write a backend that uses the file system for that tree structure...
> - Did you have any idea about what kind of search interface? Were you
> thinking a text field with Boolean capability, or just letting them throw
> some words in the field and see what sticks?
Well, for a first pass, something simple will do, but the nicer you
can make it, the better off we'll be...
> - Were you thinking of an engine that has an indexing process that runs
> via a cron job or something, or something much simpler? (like one that
> just brute force searches through the text of the archives for each query;
> that would be slow as hell if the archive was large, but would take no
> additional disk space and wouldn't really require persistence).
It'd be nice to have a per-list setting for this one. For most lists,
something simple would do, but I run a few where indexing would
certainly be better...
John