patch to make GBayes work with maildir and anydbm

Tim Peters tim_one at email.msn.com
Sat Aug 24 04:06:43 EDT 2002


[Neale Pickett]
> I've been playing around with GBayes.py and classifier.py from python
> CVS.
> ...
> However, GBayes.py used up over 76M of RAM while it was running, so I
don't
> think right now it's very practical for a multi-user system.

Heh.  It's not even intended to be practical on a single-user system,
Neale -- it's not production code, it's written the way it is to make it
easy to do algorithm research.  As you've discovered, despite that it's not
yet even bothering to skip encoded binary gibberish, it's *faster* than
"scalable" alternatives, and turnaround time is much more important than
resource consumption while we're in the research phase (which means running
many variations of the algorithm on many data sets, accumulating evidence
(instead of anecdotes) about what does and doesn't work).

I've got the background to do this kind of hard-hearted <wink> triage, but
not much time to give to it, so the code is likely to stay of marginal
practical value for some time to come.  You're certainly welcome to play
with it, but I can't dilute my time on this more than it already is, so if
you want to make a practical spinoff, I'd suggest forking the code and
starting a new project on SourceForge.  Provided you don't go nuts
hyper-optimizing it, you should be able to pick up algorithmic improvements
as they're made.  The algorithm right now has a number of questionable
aspects (which I've already spent too much time typing about on Python-Dev).

BTW, it should require only minor changes to make it use a persistent
OOBTree under ZODB instead of a dict (it was designed with that transition
in mind, but that's premature right now).  Or if you want to save a lot of
work, Eric Raymond is keeping an eye on GBayes and will probably track good
algorithmic changes in his bogofilter project:

    <http://www.tuxedo.org/~esr/bogofilter/>

Eric is very keen to minimize speed and space requirements, and bogofilter
already run circles around most other incarnations of this kind of thing.





More information about the Python-list mailing list