[Spambayes] training WAS: aging information
tim.one at comcast.net
Wed Feb 19 20:19:43 EST 2003
> Thanks for (re)posting this link, certainly interesting reading. Were
> these done before or after the experimental_ham_spam_in_balance code?
[T. Alexander Popiel]
> Before; I like to think that my results were in part responsible for
> getting that option added.
They certainly were. At that time, tests on my 35,000 msg corpora were
already too good to show any improvement (by any means), so all I could say
for sure is that adding the option didn't hurt my main test's results.
Some brief experiments on lopsided subsets suggested it would help.
Sjoerd reported stronger positive results on his real-life test data.
Someone later (Anthony?) reported negative results, but staring at the data
I didn't immediately agree they were significant results, and ran out of
time to argue the issue. So it remained an option.
> Well, as long as the 300 ham chosen are actually representative of
> the types of ham you get, I don't see any harm in only using 300.
> I don't have the math or the experimental results to back that up,
My home email classifer is still trained on fewer than 1,000 msgs total,
about 40/60 ham/spam. Since I get about 600 emails per day, this is less
than two days' traffic. I get a few (2 to 10) Unsures each day, but they're
generally so unusual I don't bother to train on them. At least half the
time, I'm not sure whether they're ham or spam either and just delete them
with a shrug.
Cool: last week I got signed up on some commercial spam mailing list, along
with hundreds of others, and of course this triggered a near-endless cascade
of newbies posting outraged msgs to the list demanding to be taken off, then
other newbies demanding to know why the first batch was accusing them of
sending spam, etc etc etc. I had to train on 3 of those before they
reliably moved from Unsure to Spam (the header clues were great; the msg
bodies were hopeless), and was spared perhaps 600 more of these things.
Note that this stuff wasn't really spam by most meanings of the word: it
was sent by real people, and was not automated. I still love that spambayes
believes whatever I tell it to believe!
More information about the Spambayes