[Spambayes] Spamvolution

Guido van Rossum guido@python.org
Fri, 20 Sep 2002 12:09:38 -0400


> What we may be left with is spam like this (a persistent false negative in
> my test data):
> 
> """
> Return-Path: <kba94@earthlink.net>
> Delivered-To: bfsmedia-bluegoose.kennels@bfsmedia.com
> Received: (qmail 16482 invoked from network); 11 Mar 2002 16:01:28 -0000
> Received: from harrier.mail.pas.earthlink.net (HELO
> harrier.prod.itd.earthlink.net) (207.217.120.12)
>   by agamemnon.bfsmedia.com with SMTP; 11 Mar 2002 16:01:28 -0000
> Received: from pool-209-128-140-231.gent.ipa.net ([209.128.140.231]
> helo=13vys)
>         by harrier.prod.itd.earthlink.net with smtp (Exim 3.33 #1)
>         id 16kSN8-0004nj-00; Mon, 11 Mar 2002 08:10:04 -0800
> From: "Keith Allison" <kba94@earthlink.net>
> To: "Amy Kight" <amykight@surfari.net>,
>         "Angel Baker" <piper1026@msn.com>,
>         "Baker, Chris & Tammy" <bakert@ipa.net>,
>         "Baker, Kerry" <kbaker@delta.sesc.k12.ar.us>,
>         "Batagower, Holly" <hbatagower@acnielsen.com>,
>         "Bo" <BoJessica@cs.com>,
>         "Brad Hesgard" <Bheswad@aol.com>,
>         "Brian Gilbert" <owl_45@hotmail.com>,
>         "Brian Shinall" <shinall@1s.net>,
>         "Briggs, A.J." <ATMABRIGGS@JUNO.COM>,
>         "Bryant, Cindy" <cbryant@kraft.com>,
>         "Burris, Jay" <jaybduck@ipa.net>,
>         "Burturm, Jodi" <jodi.burtrum@satellink.net >,
>         "Butler, Scott or Revonne" <butlers@hcil.net>,
>         "Christina Clark" <christinarclark@hotmail.com>,
>         "Clampit, Connie Joe" <clampitc@tyson.com>,
>         "Clifford Toney" <cliffordtoney@cox-internet.com>,
>         "Combs, Karrie" <karrie.combs@us.nestle.com>,
>         "David Haffner" <dhaffner@ipa.net>,
>         "Dennis Ditto" <dittohdditto@cs.com>,
>         "Eddie Young" <ebfarms@seark.net>,
>         "Eliza Bowman" <elbowman@monad.net>,
>         "Froning, Sara" <sfroning@kraft.com>,
>         "Gary McDonald Jr." <garymc@icnet.net>,
>         "Glen Hemminger" <gj.hemminger@home.com>,
>         "Good, Dennis" <Hogman9832@aol.com>,
>         "Janet Thornhill \(E-mail\)" <Janet@thornhillauto.net>,
>         "Jared Woodward" <Jdwood19@cs.com>,
>         "Joel Strickland" <joel@aboweb.com>,
>         "John Calhoun" <Calhoonj9@cs.com>,
>         "John Metz" <JTMETZ1@aol.com>,
>         "Jones, Shelley" <RTSRJ@aol>,
>         "Lacy, Scott" <Slacey@Sofnet.com>,
>         "Leonard, Tammy" <tndleonard@aol.com>,
>         "Linda Cline" <cline@somedayretrievers.com>,
>         "Lynn A. Yandell" <yandell@uark.edu>,
>         "M. D. Baxter" <baxram@bellsouth.net>,
>         "Mark So" <info@dogtra.com>,
>         "McDonald, Gary" <gmcdonald1@mmcable.com>,
>         "Michael Carter" <michaelraycarter@yahoo.com>,
>         "Mike Anthony" <mikeanth@earthlink.net>,
>         "Moon, Billy" <MoonPhase5@aol.com>,
>         "Myers, Chuck" <cemyers@gte.net>,
>         "Nancy Belknap" <nhah103@ipa.net>,
>         "Nancy R Oldfather" <Nancy.Oldfather@kraft.com>,
>         "Nathen C. Neal" <nneal@uark.edu>,
>         "Nunn, Randy" <RandHNunn@aol.com>,
>         "Owen Bybee" <owen@bomontana.com>,
>         "Perry Cox" <perrycox@hotmail.com>,
>         "Philip Hattaway" <phattaway@kc.rr.com>,
>         "Richard Davis" <richard@gofit.net>,
>         "Robert Clay Connor" <c2dg@midsouth.rr.com>,
>         "Roberts, Karen" <karenr@ipa.net>,
>         "Rose, Ernie & Pat" <bluegoose.kennels@bfsmedia.com>,
>         "Shelley. Allison" <shelley.allison@us.pwcglobal.com>,
>         "Stan Lewis" <slewis@iqcisp.com>,
>         "Stephen Wright" <stephen.wright@leggett.com>,
>         "Steve Vanduine" <vanduine@ukcdogs.com>,
>         "Sutton, Wes" <medicus@medicusrg.com>,
>         "Tami Baker" <cte87344@centurytel.net>,
>         "Ted Cullen" <info@etch-marc.com>,
>         "Teresa Fuller" <bsandtb@peoplepc.com>,
>         "Todd Wilken" <tjwilken@premiernet.net>,
>         "Tracy Johnston" <tracyjohnston@service-advantage.com>,
>         "Whelan, Ken" <ken@UBH.com>

Stop right there.  You *are* counting the sheer length of the "To:"
header as a spam indicator, right?  Perhaps alphabetized lists should
be considered an additional spam indicator?  (AFAIK your current
tokenizer doesn't pick up on the alphabetization I'm sure so you'll
never know how much information there is in that until you try it.
I've seen many spams with slices of the alphabet in the To: line.)

(And yes, I still plan to run an experiment on my own mailbox.  None
of the answers on how to use MH folders were sufficient, but I think
it'll be trivial to augment splitndirs.py and others to support the
same kind of argument that hammie.py supports.)

--Guido van Rossum (home page: http://www.python.org/~guido/)