[Spambayes] Introducing myself

T. Alexander Popiel popiel@wolfskeep.com
Tue Nov 12 02:47:57 2002


In message:  <a05200f3fb9f60c3ccb9e@[192.168.1.103]>
             Robert Woodhead <trebor@animeigo.com> writes:
>
>My hunch, based on things I've done in the past, is that as the total 
>volume of mail increases, the rate of increase in the number of 
>unique tokens will approach a limit (that being, the number of 
>distinct individual words in the language, though foreign unicode 
>gibberish will have an effect).  When I was doing single word 
>analysis on a quarter-gig of ham and spam I was seeing, IIRC, about 
>300,000 distinct tokens (including the aforementioned gibberish).

Rob Hooft recently (yesterday, that is) did a nice analysis and
graph of database growth based on message count.  He found it
scaled almost linearly with the sqrt of the number of messages...
but he only went up to a total of about 22000 messages, which is
likely only about a fifth of a gig.

>It will be interesting to see the results of some data reduction on 
>the accuracy of the recogniser.  My WAG is that even some serious 
>hashing (down to, say, 20 bit tokens) won't have much effect on 
>accuracy because most of the collisions will be between low 
>frequency, insignificant tokens.

Tim Peters did some hashing experiments back on 3 Nov; he posted these
results:

OK, doing a 10-fold cross-validation run across 2000 random ham and 2000
random spam, but the same random sets for "before" and "after":

filename:    before     crm
ham:spam:  2000:2000
                   2000:2000
fp total:        1    1604
fp %:         0.05   80.20
fn total:        0       0
fn %:         0.00    0.00
unsure t:       20       0
unsure %:     0.50    0.00
real cost:  $14.00$16040.00
best cost:   $2.00 $228.00
h mean:       0.55   53.54
h sdev:       4.50    5.30
s mean:      99.91   71.40
s sdev:       1.64    6.84
mean diff:   99.36   17.86
k:           16.18    1.47

Granted, he was doing more complex word combinations with this, too,
and a different combining technique, but it really doesn't look
promising.

- Alex



More information about the Spambayes mailing list