[Spambayes] Re: CRM114 in November breaks 99.9%. :-)
Bill Yerazunis
wsy at merl.com
Tue Dec 3 14:51:59 2002
From: Brian Burton <brian@burton-computer.com>
> Training a particular incarnation of CRM114 usually takes a week or
> two; I read my mail (both categories) and when I find a piece of mail
> misclassified, I train that one piece into the filter.
Training only on errors after a cut-off point is interesting. Why do you
do this? Is there a reason not to increment the good/spam counts for terms
in every email? Is it to avoid overflowing the counts in your hash table
or is this likely to be more accurate since it keeps the message counts
small?
The reason I started doing it is that I used "unsigned char" as the
counters in the big hash tables, to keep them as small as reasonable
(remember, we're doing really _random_ accesses of these files and we
thrash virtual memory and cache like crazy). The bin incrementer is
"smart" in that it won't wrap past 255, but it is losing data at that
point, and losing it on the _most_ significant features.
I did consider "uncorking" the values up to unsigned int16, but I
haven't had a good justification to do that yet. It's a simple change
and if there's a need, it'll happen.
> After a couple of days the errors get very sparse; after two or three
> weeks, I "go for data" and that's what gets reported in the monthlies.
Perhaps I misunderstand, but doesn't that mean that you are training up to
a desirable accuracy before beginning to measure your accuracy? Is the
transition from training to performance measurement based on a
predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in
corpus, or 14 calendar days of training) or based on the accuracy rising to
a certain level?
It's measured intuitively, by when I find I'm just not getting enough
errors to keep my attention in training. This _is_ human-guided
training, mind you. Other influences on when to start are "it's the
start of November, start getting data". and "now that the BCR has that
nasty underflow problem fixed and the data has settled down, let's get
numbers".
The other issue that can't be dodged is that spam is not ergodic; spam
evolves in fits and starts; my spam of 1996 is very different than my
spam of 2002. Any filter that is trained and tested against data
statically is operating "in vitro"- a necessary and useful scientific
measure but it misses the point of how well a spam filter can retrain
on the fly against evolution in action.
The training period coincidentally works out to be about 2+ weeks of
training, and co-coincidentally I usually have just a few bins in the
hash table maxing out about then. (right now I've got 7 bins out of a
million maxed out in the spam hashtable, and 5 bins out of a million
maxed out in the nonspam hashtable.) If I were to find that I
was maxing out a significant number of bins (say, hundreds) I'd
rebuild with unsigned int16 bins and accept the performance hit.
(yes, this is a very "engineering" style approach; I'm not a good
mathematician, so I just do experiments and report on what comes back.)
For those of you with exceptionally high boredom thresholds, the
current under-test spectra histograms follow. It does exhibit a
comforting long distribution tail.
-Bill Y.
Sparse spectra file spam.css has 1048577 bins total
total number of hash datums in this file is 398830
now scanning bins- please be patient...
bin value 0 found 786135 times
bin value 1 found 188350 times
bin value 2 found 48948 times
bin value 3 found 11125 times
bin value 4 found 8550 times
bin value 5 found 2511 times
bin value 6 found 992 times
bin value 7 found 464 times
bin value 8 found 470 times
bin value 9 found 240 times
bin value 10 found 140 times
bin value 11 found 104 times
bin value 12 found 77 times
bin value 13 found 65 times
bin value 14 found 46 times
bin value 15 found 47 times
bin value 16 found 32 times
bin value 17 found 36 times
bin value 18 found 19 times
bin value 19 found 17 times
bin value 20 found 30 times
bin value 21 found 11 times
bin value 22 found 14 times
bin value 23 found 8 times
bin value 24 found 7 times
bin value 25 found 7 times
bin value 26 found 6 times
bin value 27 found 10 times
bin value 28 found 9 times
bin value 29 found 7 times
bin value 30 found 6 times
bin value 31 found 6 times
bin value 32 found 5 times
bin value 33 found 2 times
bin value 34 found 5 times
bin value 35 found 2 times
bin value 36 found 6 times
bin value 37 found 5 times
bin value 38 found 2 times
bin value 39 found 2 times
bin value 40 found 4 times
bin value 41 found 2 times
bin value 43 found 3 times
bin value 44 found 1 times
bin value 46 found 3 times
bin value 47 found 1 times
bin value 50 found 2 times
bin value 52 found 3 times
bin value 53 found 3 times
bin value 55 found 1 times
bin value 56 found 3 times
bin value 58 found 1 times
bin value 60 found 1 times
bin value 62 found 1 times
bin value 64 found 1 times
bin value 69 found 1 times
bin value 73 found 1 times
bin value 74 found 1 times
bin value 76 found 1 times
bin value 77 found 1 times
bin value 89 found 1 times
bin value 90 found 2 times
bin value 103 found 1 times
bin value 105 found 2 times
bin value 116 found 1 times
bin value 121 found 1 times
bin value 130 found 1 times
bin value 143 found 1 times
bin value 146 found 1 times
bin value 157 found 1 times
bin value 171 found 1 times
bin value 175 found 2 times
bin value 189 found 1 times
bin value 208 found 1 times
bin value 255 found 7 times
Sparse spectra file nonspam.css has 1048577 bins total
total number of hash datums in this file is 299527
now scanning bins- please be patient...
bin value 0 found 819494 times
bin value 1 found 187269 times
bin value 2 found 31009 times
bin value 3 found 7158 times
bin value 4 found 1776 times
bin value 5 found 614 times
bin value 6 found 371 times
bin value 7 found 165 times
bin value 8 found 100 times
bin value 9 found 76 times
bin value 10 found 74 times
bin value 11 found 46 times
bin value 12 found 46 times
bin value 13 found 29 times
bin value 14 found 46 times
bin value 15 found 53 times
bin value 16 found 38 times
bin value 17 found 16 times
bin value 18 found 24 times
bin value 19 found 9 times
bin value 20 found 5 times
bin value 21 found 11 times
bin value 22 found 7 times
bin value 23 found 13 times
bin value 24 found 5 times
bin value 25 found 6 times
bin value 26 found 6 times
bin value 27 found 5 times
bin value 28 found 3 times
bin value 29 found 3 times
bin value 30 found 10 times
bin value 31 found 5 times
bin value 32 found 4 times
bin value 33 found 4 times
bin value 34 found 3 times
bin value 35 found 3 times
bin value 36 found 5 times
bin value 37 found 2 times
bin value 38 found 3 times
bin value 39 found 3 times
bin value 40 found 2 times
bin value 41 found 2 times
bin value 45 found 1 times
bin value 46 found 2 times
bin value 48 found 3 times
bin value 49 found 3 times
bin value 50 found 1 times
bin value 51 found 1 times
bin value 52 found 2 times
bin value 54 found 1 times
bin value 55 found 1 times
bin value 56 found 1 times
bin value 57 found 1 times
bin value 58 found 1 times
bin value 59 found 1 times
bin value 60 found 1 times
bin value 64 found 1 times
bin value 66 found 1 times
bin value 67 found 1 times
bin value 71 found 2 times
bin value 72 found 1 times
bin value 74 found 1 times
bin value 75 found 1 times
bin value 78 found 1 times
bin value 79 found 1 times
bin value 80 found 2 times
bin value 82 found 2 times
bin value 83 found 1 times
bin value 86 found 1 times
bin value 95 found 1 times
bin value 102 found 1 times
bin value 104 found 1 times
bin value 113 found 1 times
bin value 122 found 1 times
bin value 138 found 1 times
bin value 164 found 1 times
bin value 169 found 1 times
bin value 173 found 1 times
bin value 183 found 1 times
bin value 189 found 1 times
bin value 222 found 1 times
bin value 254 found 1 times
bin value 255 found 5 times
Enter bin value to zeroize, or 0 to exit:
More information about the Spambayes
mailing list