[Spambayes] smart spam

Richard Jowsey richard at jowsey.com
Sat Apr 5 11:48:13 EST 2003

> >You betcha!
> What's that do to the size of your database?

There aren't enough of these mini-spams coming down the pipe to bloat 
the database. Considering there's only, say, 500-1000 distinct words 
on these "spurped" sites, and many of the words are already known to 
the database, it's just the same as an email with a 5-10k text 

Currently, my spam + good + virus databases (including hapaxes), are 
still under 15Mb total size. That's from an initial training corpus 
of about 15k good, 50k spam, and around 15k messages through the beta 
proxy in the past month...

   Good messages:         21,676
   Unique good words:    326,001
   Total good words:  11,446,259
   Datafile size (Kb):     2,898

   Spam messages:         59,090
   Unique spam words:    770,800
   Total spam words:  36,973,011
   Datafile size (Kb):    11,083


