[Spambayes] What is spam?

Tim Peters tim.one@comcast.net
Mon, 16 Sep 2002 21:34:14 -0400


[Neale Pickett]
> My datasets aren't as pure as I thought :(

Nobody's have been -- it seems to come as a shock <wink>.

> While sorting through my FNs and FPs, I've found some trends:
>
> 1.  When people forward spam to me, it gets tagged as spam.  I have a
>     lot of forwarded spam in my inbox; I've asked people to send me
>     stuff is the past so I can get a feel for what sort of spam is being
>     sent to my users.  I use this feel to blacklist domains in the MAIL
>     FROM: SMTP command.  It works pretty well when I stay on top of it.

When mine-received-headers is true (it's false by default), Neil's code
picks out IP addresses (and their prefixes) and machine names from the
Received headers, and delivers them as tagged tokens to the classifier.  The
classifier will then learn which of these guys are and aren't good spam
indicators.  However, it will also pick up clues about the people who
forwarded this stuff, unless you're careful to strip away forwarding
artifacts.

> 2.  Non-spam I'd erroneously entered into my spam corpus gets most of
>     the false negatives.  Neato!

I've even found hams in bruceg's spam collections (e.g., one was the output
from a cron job he apparently ran on one of his collection machines).

> 3.  Stupid forwards (mostly urban legends, exultations to
>     pray/boycott/vote a certain way, jokes, or inspirational stories)
>     are not tagged as spam.  I get a lot of these too, from my
>     grandmother and certain friends who seem to do nothing but relay
>     chain letters.  But I don't get enough to train the filter against
>     them, apparently.

Can't guess whether *you* classified these as ham or spam.  In any case, you
don't have much training data yet (your last report had a thousand of each).

> With only one or two exceptions, that is the extent of my false
> positives and false negatives.
>
> I have to wonder, though, if the forwards (#3) are really false
> negatives.

False negative means that you put them in a spam folder but that the system
said they were ham.  Is that what you meant to say?

> Should I have those in the ham folder, and be using another
> method to weed out garbage of that type?  I'm not sure the current
> classifier is up to sorting out urban legends :)

In an ideal world you'd never see a message classified as spam.  Do you want
to see these messages or not?  Train accordingly; the system will learn over
time.

> In any case, it's becoming clear to me that in the future when we're all
> trying to help our grandmothers install spambayes, there will have to be
> some way of reviewing FPs and FNs the way we're all doing it now.  In my
> case at least, a lot of FPs and FNs aren't really F.

Put the things you think are ham in your ham collection, and the things you
think are spam in your spam collection.  By definition, a false positive is
a ham identified as spam -- if an FP is "not really F", it was *correctly*
identified as spam, and so didn't belong in your ham collection to begin
with; likewise for an FN that's not really F, it should have been in your
ham collection.  It learns what you teach it.