[spambayes-dev] More obvious logarithmic expiration data

Matthew Dixon Cowles matt at mondoinfo.com
Sat Jun 7 16:10:32 EDT 2003


I mentioned a while ago that I'd do a little more work based on the
statistics that I had collected that showed that tokens that figured
in scoring were likely to be used for scoring again soon.

I instrumented classifier.py and hammie.py to compute several scores
and log them when computing a score. Each time SpamBayes computes a
score, it also computes scores using only tokens that had been used
in scoring in the previous 24 hours, the previous week, the previous
two weeks, and the previous 30 days.

Here are some results:

2587 sets of scores processed
Number of scores that differ from actual score
by 0.00           6885
by 0.01 or less    633
by 0.10 or less    179
by 0.20 or less     32
by more than 0.20   32

(The repeated 32 isn't a bug, I checked.)

Because of a flaw in the way I set up the log, the "actual" score
isn't quite the actual score. Rather, it's the score that used only
tokens that have been used in the last 30 days. But I'm convinced
that it's very near to the actual score.

Also encouragingly, the score changes that happen don't seem to move
the scores out of the standard 0.0-0.2 and 0.9-1.0 categories much:

                         Moved out of spam   Moved out of ham
Restricted to one day                   13                  8
Restricted to one week                   3                  0
Restricted to two weeks                  2                  0

If I were cleverer, I'd have guessed all this from the number of
posts in which people have said that they've trained SpamBayes on
only a couple of hundred emails and that it's already working well
for them. But then I wouldn't have the fabulous collection of
ambiguous and invalid data that came before looking at how often
tokens are used for scoring <wink>.

Judging from this data, I could relatively painlessly use a database
that contains only those tokens that have figured in scoring in the
last ten days or so. That's about 11% of the 273487 tokens in my
database.

You'd need to bootstrap the process, presumably by counting a token
as used when it's first trained on. Waiting for a token to be used
before making it eligible for use has a certain theoretical elegance
but results might suffer <wink>.

And, of course, it's not really time that counts but rather the
number of emails seen. I seem to get something like 150 emails per
day. So that 10-day period is really 1500 emails scored. Adding an
extra field to the "saved state" entry and recording the number of
emails scored there and in the WordInfo record seems practical on the
face of it.

Ironically, I started collecting these statistics when I was using a
laptop with a tiny hard disk. Now, with 60G at my disposal, the 23M
that my database takes up is pretty trifling.

Regards,
Matt




More information about the spambayes-dev mailing list