[Spambayes] expiration ideas.

Tim Peters tim.one@comcast.net
Mon Oct 21 06:57:09 2002


[Anthony Baxter]
> Just thinking again about expiration, and wondering if the following
> would work:
>
>   When training new data (say a new week's worth), train it with a
>   new classifier ("interim"). Once it's trained, merge the interim
>   classifier's wordinfo into your master classifier wordinfo by adding
>   the new spamcounts and hamcounts to the master wordinfo blob, then
>   recalc probabilities.
>
>   Keep the "interim" wordinfo around (gzipped, datestamped) until your
>   expiration time is up - then undo the earlier merge, subtracting
>   the spamcount/hamcounts.
>
> Thoughts? Unless there's a screamingly obvious "don't be stupid" I'll
> play with this tomorrow (ah, leave....)

It's sure the most principled idea I've heard, in that it would always leave
the database corresponding exactly with *some* real-world collection of
msgs.

OTOH, what's the purpose of expiration?  I can think of two:

1. To reduce database size.

2. To accelerate adaptation to changes in ham and/or spam.

I don't know that #2 is a real problem, and some reason to doubt it.  Over
the weekend, I tried my c.l.py ham + bruceg spam classifer on newer data
Greg Ward harvested from all non-personal python.org traffic (which turns
out to be partly untrue:  python.org also hosts a few small & unadvertised
"hobby lists" I didn't know about, and they count as "personal email" to
me).

Anyway, the c.l.py classifier had a very high FP rate, and especially on the
"hobby list" traffic.  But its FN rate was identical to that of a classifier
trained from scratch on the new data:  1 FN, under chi's rules for FN.

This suggests that everyone is right in believing that spam is much the
same.  So far as changes in ham go, it suggests that a significantly new
source of ham needs to be trained on ASAP, lest it be viewed as spam.

About #1, there are lots of things that haven't been tested properly, the
most obvious being to purge unique words from the database immediately after
training.  That should cut the database size in half with one quick and easy
stroke.  Whether it hurts performance is unknown.

At the start, my favorite gimmick was embodied in the atime attr of WordInfo
records:  remember the most recent time a word was used in scoring, and get
rid of words that haven't been used "recently".  If they're not being used,
then getting rid of them can't affect accuracy.  It addresses both #1 and
#2, but #1 on a revolving-door basis, and #2 in only a very weak sense.