[Spambayes] RE: For the bold
Sun, 06 Oct 2002 19:40:25 +0200
Tim Peters wrote:
> [Rob Hooft]
>>Tim: It does look like your messages are a bit easier to classify than
> I don't know. The results I reported were:
>>Here's a use_central_limit2 run with max_discriminators=50, trained
>>on 5000 ham and 5000 spam, then predicting against 7500 of each
> and all runs were on the same set of msgs.
> The last time you mentioned how "big" your tests are was:
>>I focussed for our night on optimizing the max_discriminators for
>>clt2 using 10x(200+200) messages out of my corpses,
> I'm not sure exactly what 10x(200+200) means, but at the plausible extremes
> it means your classifiers were trained on 200 on each, or on 1800 of each.
> So at worst, my classifier was trained on 3x as much data, and at best on
> 25x as much data. Error rates certainly improve with more training data,
> albeit slowly.
I did have 10 sets each of ham and spam, each set containing 200
messages out of a total reservoir of ~17000 ham and 7500 spam. This
subset of everything was heavy enough for this optimization: it took
about 24 hours of calculating to get that analysis done....
> OTOH, later you showed output saying
>>Reading climbig12.pk ...
> so at *some* point you stopped predicting against equal amounts of ham and
> spam, but there's no way to guess how much was trained on for that result.
At that point, I had 10 sets, each ham set contained 1600 hams, and each
spam set 700 spams. I was using 2 sets each to train, and 8 to analyse.
Since that time I have cleaned out the spam body by looking for
duplicate "Date:" headers, and removed ~1300 spams that were identical
(only sent to different addresses). I think this is a useful thing to do
to prevent that the same spam in two messages is both in the training
and in the test set. The "Message-ID" sort I did in the beginning didn't
help all that much, because lots of these spams do not have their
message-id added by the spammer.
I am currently using 10 sets of 1600 ham, and 10 sets of 560 spam. I am
now using 1,2,3,4,5 to train and 6,7,8,9,10 for analysis, and a second
test takes 6,7,8,9,10 to train and 1,2,3,4,5 for analysis.
> That said, I expect my ham is easier than most, because newsgroup traffic
> almost never contains personal msgs -- no screaming red HTML birthday wishes
> from 9-year-old nieces, no confirmations of payment received, no opt-in
> marketing newsletters, no chain letters forwarded from naive brothers, etc.
Exactly. I find that my ham is very diverse. Besides all the things you
mentioned, I had (but removed) communications with postmasters over
early-day spam that was sent using their machines. And I am using some
ham from my previous job. There is not a lot of mailing list traffic,
because I am no longer storing all that. Lots of customer E-mails with
many different computer backgrounds. I removed all the viruses.
Rob W.W. Hooft || firstname.lastname@example.org || http://www.hooft.net/people/rob/