[Spambayes] test sets?

Fri, 06 Sep 2002 12:45:09 -0400

[Barry A. Warsaw, gives answers and asks questions]

Here's the code that produced the header tokens:

    x2n = {}
    for x in msg.keys():
        x2n[x] = x2n.get(x, 0) + 1
    for x in x2n.items():
        yield "header:%s:%d" % x

Some responses:

>  0.01    19  3559 'header:X-Mailman-Version:1'
>  0.01    19  3559 'header:List-Id:1'
>  0.01    19  3557 'header:X-BeenThere:1'
>
> These three are definitely MM artifacts, although the second one
> /could/ be inserted by other list management software (it's described
> in an RFC).

Since all the ham came from Mailman, and only 19 spam had it, it's quite
safe to assume then that I should ignore these for now.

>  0.01     0  3093 'header:Newsgroups:1'
>  0.01     0  3054 'header:Xref:1'
>  0.01     0  3053 'header:Path:1'
>
> These aren't MM artifacts, but are byproducts of gating a message off
> of an nntp feed.  Some of the other NNTP-* headers are similar, but I
> won't point them out below.

I should ignore these too then.

>  0.01    19  2668 'header:List-Unsubscribe:1'
>  0.01    19  2668 'header:List-Subscribe:1'
>  0.01    19  2668 'header:List-Post:1'
>  0.01    19  2668 'header:List-Help:1'
>  0.01    19  2668 'header:List-Archive:1'
>
> RFC recommended generic listserve headers that MM injects.

Ditto.

> So why do you get two entries for this one?
>
>  0.99   519     0 'header:Received:8'
>  0.99   466     1 'header:Received:7'

Read the code <wink>.  The first line counts msgs that had 8 instances of a
'Received' header, and the second counts msgs that had 7 instances.  I
expect this is a good clue!  The more indirect the mail path, the more of
those thingies we'll see, and if you're posting from a spam trailer park in
Tasmania you may well need to travel thru more machines.

> ...
> Note that header names are case insensitive, so this one's no
> different than "MIME-Version:".  Similarly other headers in your list.

Ignoring case here may or may not help; that's for experiment to decide.
It's plausible that case is significant, if, e.g., a particular spam mailing
package generates unusual case, or a particular clueless spammer
misconfigures his package.

>  0.02    65  3559 'header:Precedence:1'
>
> Could be Mailman, or not.  This header is supposed to tell other
> automated software that this message was automated.  E.g. a replybot
> should ignore any message with a Precedence: {bulk|junk|list}.

Rule of thumb:  if Mailman inserts a thing, I should ignore it.  Or, better,
I should stop trying to out-think the flaws in the test data and get better
test data instead!

>  0.50     4     0 'header:2:1'
>
> !?
> ...
>  0.50     0     2 'header:'
>
> Heh?

I sucked out all the wordinfo keys that began with "header:".  The last line
there  was probably due to unrelated instances of the string "header:" in
message bodies.  Harder to guess about the first line.

> ...
> Some headers of course are totally unreliable as to their origin.  I'm
> thinking stuff like MIME-Version, Content-Type, To, From, etc, etc.
> Everyone sticks those in.

The brilliance of Anthony's "just count them" scheme is that it requires no
thought, so can't be fooled <wink>.  Header lines that are evenly
distributed across spam and ham will turn out to be worthless indicators
(prob near 0.5), so do no harm.