[Spambayes] Hand tuning the database?

Webb Scales scales at zko.dec.com
Mon Feb 9 18:44:20 EST 2004

A friend of mine, looking at the "evidence" header in one of my mail messages,
asked a "simple" question:  what about the idea of hand-tuning the database?

I shouldn't be criticizing, as SpamBayes has been doing a very good job so far
(on just the initial training, it's had no misclassified ham, only two spam
rated as ham, and only a dozen messages rated as unsure all of which were
spam), but I thought I'd ask anyway.  ;-)

The evidence header has entries like "'received:ztxmail01.ztx.compaq.com':
0.62".  (I told it to mine the headers.)  Now, I believe that
ztxmail01.ztx.compaq.com handles all my mail.  (OK, it doesn't hand *all* of
my mail -- it's got a couple of brothers and a dozen cousins who share in the
load, but you get the point.)  So, the presence of this token in my mail
message is not indicative of anything (other than the fact that the thing
being looked at is a "mail message"! ;-).

What do you guys think of the idea of being able to mark certain terms in the
database as being "not interesting"?  (Of course, we would need a tool or
tool-set to be able to do this, but....)

The reason I ask is that, if the classifier is only going to consider 100
terms, I'd like it to be considering good ones, as opposed to things that are
in every mail message that I get, spam and ham alike.



Webb Scales                                Hewlett-Packard Company
scales at zko.dec.com                         110 Spit Brook Rd, ZKO2-3/N30
Voice: 603.884.2196, FAX: 603.884.0120     Nashua, NH 03062-2711
Someone who thinks logically provides a nice contrast to the real world.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes/attachments/20040209/308b965f/attachment.html

More information about the Spambayes mailing list