[Spambayes] Re: CRM114 in November breaks 99.9%. :-)

Bill Yerazunis wsy at merl.com
Tue Dec 3 14:51:59 2002


   From: Brian Burton <brian@burton-computer.com>

   > Training a particular incarnation of CRM114 usually takes a week or
   > two; I read my mail (both categories) and when I find a piece of mail
   > misclassified, I train that one piece into the filter.

   Training only on errors after a cut-off point is interesting.  Why do you 
   do this?  Is there a reason not to increment the good/spam counts for terms 
   in every email?  Is it to avoid overflowing the counts in your hash table 
   or is this likely to be more accurate since it keeps the message counts 
   small?

The reason I started doing it is that I used "unsigned char" as the
counters in the big hash tables, to keep them as small as reasonable
(remember, we're doing really _random_ accesses of these files and we
thrash virtual memory and cache like crazy).  The bin incrementer is
"smart" in that it won't wrap past 255, but it is losing data at that
point, and losing it on the _most_ significant features.

I did consider "uncorking" the values up to unsigned int16, but I
haven't had a good justification to do that yet.  It's a simple change
and if there's a need, it'll happen.

   > After a couple of days the errors get very sparse; after two or three
   > weeks, I "go for data" and that's what gets reported in the monthlies.

   Perhaps I misunderstand, but doesn't that mean that you are training up to 
   a desirable accuracy before beginning to measure your accuracy?  Is the 
   transition from training to performance measurement based on a 
   predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in 
   corpus, or 14 calendar days of training) or based on the accuracy rising to 
   a certain level?

It's measured intuitively, by when I find I'm just not getting enough
errors to keep my attention in training.  This _is_ human-guided
training, mind you.  Other influences on when to start are "it's the
start of November, start getting data". and "now that the BCR has that
nasty underflow problem fixed and the data has settled down, let's get
numbers".

The other issue that can't be dodged is that spam is not ergodic; spam
evolves in fits and starts; my spam of 1996 is very different than my
spam of 2002.  Any filter that is trained and tested against data 
statically is operating "in vitro"- a necessary and useful scientific
measure but it misses the point of how well a spam filter can retrain
on the fly against evolution in action.

The training period coincidentally works out to be about 2+ weeks of
training, and co-coincidentally I usually have just a few bins in the
hash table maxing out about then.  (right now I've got 7 bins out of a
million maxed out in the spam hashtable, and 5 bins out of a million
maxed out in the nonspam hashtable.)  If I were to find that I
was maxing out a significant number of bins (say, hundreds) I'd
rebuild with unsigned int16 bins and accept the performance hit.

(yes, this is a very "engineering" style approach; I'm not a good
mathematician, so I just do experiments and report on what comes back.)

For those of you with exceptionally high boredom thresholds, the 
current under-test spectra histograms follow.  It does exhibit a
comforting long distribution tail.

	-Bill Y.

    Sparse spectra file spam.css has 1048577 bins total
    total number of hash datums in this file is 398830
    now scanning bins- please be patient...
    bin value 0 found 786135 times
    bin value 1 found 188350 times
    bin value 2 found 48948 times
    bin value 3 found 11125 times
    bin value 4 found 8550 times
    bin value 5 found 2511 times
    bin value 6 found 992 times
    bin value 7 found 464 times
    bin value 8 found 470 times
    bin value 9 found 240 times
    bin value 10 found 140 times
    bin value 11 found 104 times
    bin value 12 found 77 times
    bin value 13 found 65 times
    bin value 14 found 46 times
    bin value 15 found 47 times
    bin value 16 found 32 times
    bin value 17 found 36 times
    bin value 18 found 19 times
    bin value 19 found 17 times
    bin value 20 found 30 times
    bin value 21 found 11 times
    bin value 22 found 14 times
    bin value 23 found 8 times
    bin value 24 found 7 times
    bin value 25 found 7 times
    bin value 26 found 6 times
    bin value 27 found 10 times
    bin value 28 found 9 times
    bin value 29 found 7 times
    bin value 30 found 6 times
    bin value 31 found 6 times
    bin value 32 found 5 times
    bin value 33 found 2 times
    bin value 34 found 5 times
    bin value 35 found 2 times
    bin value 36 found 6 times
    bin value 37 found 5 times
    bin value 38 found 2 times
    bin value 39 found 2 times
    bin value 40 found 4 times
    bin value 41 found 2 times
    bin value 43 found 3 times
    bin value 44 found 1 times
    bin value 46 found 3 times
    bin value 47 found 1 times
    bin value 50 found 2 times
    bin value 52 found 3 times
    bin value 53 found 3 times
    bin value 55 found 1 times
    bin value 56 found 3 times
    bin value 58 found 1 times
    bin value 60 found 1 times
    bin value 62 found 1 times
    bin value 64 found 1 times
    bin value 69 found 1 times
    bin value 73 found 1 times
    bin value 74 found 1 times
    bin value 76 found 1 times
    bin value 77 found 1 times
    bin value 89 found 1 times
    bin value 90 found 2 times
    bin value 103 found 1 times
    bin value 105 found 2 times
    bin value 116 found 1 times
    bin value 121 found 1 times
    bin value 130 found 1 times
    bin value 143 found 1 times
    bin value 146 found 1 times
    bin value 157 found 1 times
    bin value 171 found 1 times
    bin value 175 found 2 times
    bin value 189 found 1 times
    bin value 208 found 1 times
    bin value 255 found 7 times



    Sparse spectra file nonspam.css has 1048577 bins total
    total number of hash datums in this file is 299527
    now scanning bins- please be patient...
    bin value 0 found 819494 times
    bin value 1 found 187269 times
    bin value 2 found 31009 times
    bin value 3 found 7158 times
    bin value 4 found 1776 times
    bin value 5 found 614 times
    bin value 6 found 371 times
    bin value 7 found 165 times
    bin value 8 found 100 times
    bin value 9 found 76 times
    bin value 10 found 74 times
    bin value 11 found 46 times
    bin value 12 found 46 times
    bin value 13 found 29 times
    bin value 14 found 46 times
    bin value 15 found 53 times
    bin value 16 found 38 times
    bin value 17 found 16 times
    bin value 18 found 24 times
    bin value 19 found 9 times
    bin value 20 found 5 times
    bin value 21 found 11 times
    bin value 22 found 7 times
    bin value 23 found 13 times
    bin value 24 found 5 times
    bin value 25 found 6 times
    bin value 26 found 6 times
    bin value 27 found 5 times
    bin value 28 found 3 times
    bin value 29 found 3 times
    bin value 30 found 10 times
    bin value 31 found 5 times
    bin value 32 found 4 times
    bin value 33 found 4 times
    bin value 34 found 3 times
    bin value 35 found 3 times
    bin value 36 found 5 times
    bin value 37 found 2 times
    bin value 38 found 3 times
    bin value 39 found 3 times
    bin value 40 found 2 times
    bin value 41 found 2 times
    bin value 45 found 1 times
    bin value 46 found 2 times
    bin value 48 found 3 times
    bin value 49 found 3 times
    bin value 50 found 1 times
    bin value 51 found 1 times
    bin value 52 found 2 times
    bin value 54 found 1 times
    bin value 55 found 1 times
    bin value 56 found 1 times
    bin value 57 found 1 times
    bin value 58 found 1 times
    bin value 59 found 1 times
    bin value 60 found 1 times
    bin value 64 found 1 times
    bin value 66 found 1 times
    bin value 67 found 1 times
    bin value 71 found 2 times
    bin value 72 found 1 times
    bin value 74 found 1 times
    bin value 75 found 1 times
    bin value 78 found 1 times
    bin value 79 found 1 times
    bin value 80 found 2 times
    bin value 82 found 2 times
    bin value 83 found 1 times
    bin value 86 found 1 times
    bin value 95 found 1 times
    bin value 102 found 1 times
    bin value 104 found 1 times
    bin value 113 found 1 times
    bin value 122 found 1 times
    bin value 138 found 1 times
    bin value 164 found 1 times
    bin value 169 found 1 times
    bin value 173 found 1 times
    bin value 183 found 1 times
    bin value 189 found 1 times
    bin value 222 found 1 times
    bin value 254 found 1 times
    bin value 255 found 5 times
   Enter bin value to zeroize, or 0 to exit: 


 






More information about the Spambayes mailing list