[Spambayes] SpamBayes and TREC
tim.peters at gmail.com
Mon Nov 21 02:33:52 CET 2005
>> How did SpamBayes perform in the TREC 2005 testing? Do you have any
> For a start, you can see the information here:
Lots of info, but kinda dry ;-)
> At some point during the registration process TREC latched on to
> "Massey University" (where I was working at the time, but completely
> uninvolved with SpamBayes) as my 'organisation name', so you may see
> that in some of the results. Just substitute "SpamBayes" for "Massey
> University" wherever you see it.
> I'll make my notebook paper available when I have a chance, and (once
> it's done) my proceedings paper.
> In brief, SpamBayes did better than I expected (towards the bottom of
> the top ten) considering that it is designed to classify as ham/
> unsure/spam, not ham/spam, and considering that I didn't make any
> special effort to change options, etc (in fact, it seems that the
> best variant of SpamBayes was the out-of-the-box one), nor did I put
> any effort into determining what the single cutoff should be.
I'll note that SB is also designed far more to cater to individual
quirks than to any consensus view of what "ham" and "spam" are. That
makes it an excellent choice for urologists and gynecologists <wink>.
> What surprised me the most was that the train-on-everything variant
> seems to have performed the best. I'm still looking into this; I
> hope to have more details by the time the proceedings paper is finished.
In the TREC exposure I had in the speech recognition business, they
had a large set of test data and sent out a random sample for training
purposes. The test data was static (didn't change over time), so a
random sample could be expected to more-than-less faithfully represent
the statistics of the population as a whole. Same kind of thing here?
If so, TOE works best on static data (which doesn't exist in real
life <0.5 wink>).
Anyway, congratulations on getting through the process, Tony! The
thrill of being rich and famous fades over time, so enjoy it while
it's fresh ;-)
More information about the SpamBayes