[Spambayes] What is spam?

Tim Peters tim.one@comcast.net
Tue, 17 Sep 2002 16:11:28 -0400


[Guido]
> Anything for which somebody hit the Send button should be classified
> as ham, no matter how objectionable the contents.

[Neale Pickett]
> Wow, is this the criteria other people are using?  That would radically
> alter my corpa, but I suspect I'd get much better FP and FN rates.

It's the criterion I'm using because this started as a project to filter the
tech mailing lists hosted at python.org.  So, e.g., in my ham set I keep the
long msg that did nothing but add a useless one-line comment to a full quote
of a previous Nigerian scam post.  It looks like that will always be a
"false positive" in my tests, although it's surprising that it's the *only*
message of its kind that still routinely shows up as a false positive.  The
others of that kind were sent by more frequent posters, quote only part of
the spam in question, and try to add some real content (if only a sarcastic
comment).  They get points for all that.

OTOH, when I run on smaller training subsets, sometimes that message gets
into the ham training, and as a consequence direct Nigerian scam messages
become false negatives.  Because HAMBIAS is 2 by default, it takes about 2
Nigerian scam messages in the spam training just to "cancel out" the quoted
Nigerian scam in the ham training set -- and it takes a lot more than 2 to
overcome the 0.90 "it's spam" cutoff at the end.

There's another problem here, though, which is that real people aren't going
to let Paul Graham (or even Guido <wink>) tell them what spam is.  I always
thought it was hilarious that SpamCop felt it needed to lecture would-be
reporters at length about their definition:

    http://spamcop.net/fom-serve/cache/125.html

For my personal use, all I want is something that shuffles probable spam
into a different folder.  I don't care much about the f-p rate, because I'm
going to review every "probable spam" msg by hand anyway.  The value of any
spam gimmick to me is just in keeping the most likley spam out of my main
inbox folder, so that the spam-deletion ritual doesn't interfere so much
with my normal workflow.  So for my own case I care more about reducing the
f-n rate, and consequently I would indeed file my sisters' forwarding of
junk email in my spam collection, and your grandmother's too.

BTW, this is another reason I suspect it's not going to work well to pool
users:  Greg has almost no tolerance for spam review, and I appear to be on
the opposite end of that scale.  No fixed policy can leave both of us happy,
and the more training data you have the harder it is to get any numbers
other than 0.0 and 1.0 out of this scheme.