[Spambayes] Numeric python store, hammiefilter extension and mutt
Fri Nov 22 06:10:08 2002
So then, Adam Hupp <firstname.lastname@example.org> is all like:
> training: 220s
> update_prob: 3.2s
> score 1 msg: .45s
> score 6156 msgs: 58s
> training: 14s
> update_prob: 0.10s
> score 1 msg: .59s
> score 6156 msgs: 49s
Holy cow! That's impressive!
I'm no NumPy expert but it looks like you're taking advantage of some
sort of "do this on all elements of an array" function -- what the Cray
guys used to call vectorization. I imagine NumPy can optimize that sort
of loop much better than straight CPython and you'd get speeds close to
that of compiled C.
This is a totally killer idea, except that we just decided to move
probability computation out to individual WordInfo objects! The
thinking was--and testing seems to bear this out--that when most
transactions are small incremental updates and single message scoring
(instead of batches of messages), it's faster to compute individual
word probabilities as they're needed, since it saves a ton of I/O and
perhaps a lot of needless computation.
On the other hand, this could be of tremendous benefit to long-lived
processes like the pop3proxy and the Outlook plugin, which want to keep
the whole database around in memory.
Adam, would it be possible to abstract the bayesian part of the
algorithm (the part done in update_probabilities) so that it could be
called either with a NumPy vector operation, or in a one-at-a-time
fashion by individual WordInfo classes? If you can think of a way to do
this, we can throw this in. Even if you can't think of a way to do it,
I think it might be worth it to have two implementations of the same
algorithm just for this 15x speedup.
> I also modified hammiefilter to do untraining, retraining, and
> training on filter results. For example:
> hammiefilter.py --filter --train
> The incoming message is scored and filtered. If the result is not
> "Unsure" the classifier will be trained on it.
> hammiefilter.py --reverse --good --train
> The incoming message has previously been incorrectly marked as ham.
> --reverse will untrain the classifier and --train will retrain it on
> the message as spam.
> With these tools it's straightforward to setup macros in mutt to
> manage false negatives/positives and classify "Unsure" messages.
That's good stuff. I'll have to check the list archives because I know
the issue of auto-training has been discussed and probably beaten into
the ground by now. But first I want to get my branch merged in so
everybody else can witness my dementia ;)
More information about the Spambayes