[Spambayes] test sets?

Barry A. Warsaw barry@wooz.org
Fri, 6 Sep 2002 12:16:17 -0400


>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

    TP> Barry, can you please identify for me which of these headers
    TP> are Mailman artifacts so I can avoid counting them?

Sure, with a little off-topic commentary added for no charge.

 0.01    19  3559 'header:X-Mailman-Version:1'
 0.01    19  3559 'header:List-Id:1'
 0.01    19  3557 'header:X-BeenThere:1'

These three are definitely MM artifacts, although the second one
/could/ be inserted by other list management software (it's described
in an RFC).

 0.01     0  3093 'header:Newsgroups:1'
 0.01     0  3054 'header:Xref:1'
 0.01     0  3053 'header:Path:1'

These aren't MM artifacts, but are byproducts of gating a message off
of an nntp feed.  Some of the other NNTP-* headers are similar, but I
won't point them out below.

 0.01    19  2668 'header:List-Unsubscribe:1'
 0.01    19  2668 'header:List-Subscribe:1'
 0.01    19  2668 'header:List-Post:1'
 0.01    19  2668 'header:List-Help:1'
 0.01    19  2668 'header:List-Archive:1'

RFC recommended generic listserve headers that MM injects.

 0.99   689     0 'header:Delivered-To:4'

This one's often a byproduct of the mail server.  In particularly,
Postfix and possibly others put the envelope recipient in this header.

 0.99   522     0 'header:Delivered-To:3'

So why do you get two entries for this one?

 0.99   519     0 'header:Received:8'
 0.99   466     1 'header:Received:7'

And this one?

 0.99   273     0 'header:MiME-Version:1'

Note that header names are case insensitive, so this one's no
different than "MIME-Version:".  Similarly other headers in your list.

 0.99    27     0 'header:1:1'

Huh?

 0.01     0    27 'header:X-Originally-To:1'

Mailman copies any To: header found in a message gated off of nntp to
the X-Originally-To: header.  Others possible here in clude
X-Original-To, X-Original-Cc, X-Original-Content-Transfer-Encoding,
and X-Original-Date.

 0.01     0     9 'header:X-No-Archive:1'

Could be MM or not.  This is used to stop the archiving of certain
messages, and MM will inject these into digests and password
reminders, but it's also possible that user agents have added this.

(Aside, in particularly mischievous fashion the value of this header
is "yes", so you see things like "X-No-Archive: yes".
X-Isn't-Not-Nonsense: no).

 0.02    65  3559 'header:Precedence:1'

Could be Mailman, or not.  This header is supposed to tell other
automated software that this message was automated.  E.g. a replybot
should ignore any message with a Precedence: {bulk|junk|list}.

 0.80  1471   273 'header:Delivered-To:1'

Why again?!

 0.50     4     0 'header:2:1'

!?

 0.50     3     0 'header:6:1'
 0.50     3     0 'header:5:1'
 0.50     3     0 'header:4:1'
 0.50     3     0 'header:3:1'
 0.50     2     0 'header:X-:1'

Freaky.

 0.50     0     2 'header:X-BeenThere:3'

X-BeenThere: before :)

 0.50     0     2 'header:'

Heh?

 0.50     0     1 'header:X-Silly:1'

X-Very-Silly: fneh
X-Very-Silly-Indeed: dead parrot

 0.50     0     1 'header:X-Get-A-Real-Newsreader:1'
 0.50     0     1 'header:X-Favorite-Dwarf:1'
 0.50     0     1 'header:X-Eric-Conspiracy:1'
 0.50     0     1 'header:Favorite-Color:1'

Cute. :)

Some headers of course are totally unreliable as to their origin.  I'm
thinking stuff like MIME-Version, Content-Type, To, From, etc, etc.
Everyone sticks those in.

-Barry