[Spambayes] RE: For the bold

Rob Hooft rob@hooft.net
Sun, 06 Oct 2002 19:40:25 +0200


Tim Peters wrote:
> [Rob Hooft]
> 
>>...
>>Tim: It does look like your messages are a bit easier to classify than
>>mine....
> 
> 
> I don't know.  The results I reported were:
> 
> 
>>Here's a use_central_limit2 run with max_discriminators=50, trained
>>on 5000 ham and 5000 spam, then predicting against 7500 of each
> 
> 
> and all runs were on the same set of msgs.
> 
> The last time you mentioned how "big" your tests are was:
> 
> 
>>I focussed for our night on optimizing the max_discriminators for
>>clt2 using 10x(200+200) messages out of my corpses,
> 
> 
> I'm not sure exactly what 10x(200+200) means, but at the plausible extremes
> it means your classifiers were trained on 200 on each, or on 1800 of each.
> So at worst, my classifier was trained on 3x as much data, and at best on
> 25x as much data.  Error rates certainly improve with more training data,
> albeit slowly.

I did have 10 sets each of ham and spam, each set containing 200 
messages out of a total reservoir of ~17000 ham and 7500 spam. This 
subset of everything was heavy enough for this optimization: it took 
about 24 hours of calculating to get that analysis done....

> OTOH, later you showed output saying
> 
> 
>>Reading climbig12.pk ...
>>Nham= 12800
>>RmsZham= 2.76178782393
>>Nspam= 5600
> 
> 
> so at *some* point you stopped predicting against equal amounts of ham and
> spam, but there's no way to guess how much was trained on for that result.

At that point, I had 10 sets, each ham set contained 1600 hams, and each 
spam set 700 spams. I was using 2 sets each to train, and 8 to analyse.

Since that time I have cleaned out the spam body by looking for 
duplicate "Date:" headers, and removed ~1300 spams that were identical 
(only sent to different addresses). I think this is a useful thing to do 
to prevent that the same spam in two messages is both in the training 
and in the test set. The "Message-ID" sort I did in the beginning didn't 
help all that much, because lots of these spams do not have their 
message-id added by the spammer.

I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am 
now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second
test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis.

> That said, I expect my ham is easier than most, because newsgroup traffic
> almost never contains personal msgs -- no screaming red HTML birthday wishes
> from 9-year-old nieces, no confirmations of payment received, no opt-in
> marketing newsletters, no chain letters forwarded from naive brothers, etc.

Exactly. I find that my ham is very diverse. Besides all the things you 
mentioned, I had (but removed) communications with postmasters over 
early-day spam that was sent using their machines. And I am using some 
ham from my previous job. There is not a lot of mailing list traffic, 
because I am no longer storing all that. Lots of customer E-mails with 
many different computer backgrounds. I removed all the viruses.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/