[spambayes-dev] Reduced training test results

Thu Dec 25 15:24:04 EST 2003

Training on just those messages whose score isn't 0.00 or 1.00
(rounded) seems to be a huge win over training on everything.
Not so much because the accuracy is better (though accuracy
does seem to be improved by neglecting those messages that it's
already certain about), but because of a hugely reduced training
set (and thus database).  Specifically, training on everything
yielded a database with 70,000 messages, while training only
on the non-extreme put only about 3,500 messages into the database.
Unfortunately, I don't have firm numbers on token counts.

Also of significant interest is that the classifier doesn't seem
to decay as badly over time.  With training on everything, the
unsure rate in particular (and fn to a much lesser extent) goes
up significantly after about 200 days worth of traffic, though
the fp rate stays low.  With just training on those things that
aren't already certain, the unsure rate climbs much more slowly
after 200 days (with the cumulative rate staying relatively flat),
while the fp and fn rates stay at very low values.

Details of my experiment parameters:

I've got about 77000 messages in my dataset, covering a span of
418 days.  Of these, about 21500 are ham, and nearly 56000 are spam.
I include virus/worm messages in my spam, and the "latest windows
update" worm makes its presence felt around day 360.

I divided my dataset into 10 subsets, and ran the incremental.py
harness over these 10 times, excluding 1 set each time, as per normal
cv-ish behaviour.  Thus, each of my measurements is replicated 10
times, with slightly different input data.

Finally, I did the above-mentioned 10 runs using both the 'perfect'
and 'nonedge' regimes.  The 'perfect' regime trains on every message
using the proper ham/spam classification, while the 'nonedge'
regime trains only on those messages that were not correctly
classified with 0.00 or 1.00 (rounded) scores.

I've plotted the both cumulative and 7-day average values for
error rates (fp, fn, and unsure) and training counts (ham and spam).

Pictures (and a copy of this writeup) are on my website at:
  http://www.wolfskeep.com/~popiel/spambayes/nonedge

- Alex

PS. Sorry this took so long, but running the perfect regime
    on such a large dataset took a couple days on my machine...
    I need more memory! ;-)