[Python-Dev] The first trustworthy <wink> GBayes results

Tim Peters tim.one@comcast.net
Tue, 27 Aug 2002 22:36:17 -0400

Setting this up has been a bitch.  All early attempts floundered beca=
use it
turned out there was *some* systematic difference between the ham and=
archives that made the job trivial.

The ham archive:  I selected 20,000 messages, and broke them into 5 s=
ets of
4,000 each, at random, from a python-list archive Barry put together,
containing msgs only after SpamAssassin was put into play on python.o=
It's hoped that's pretty clean, but nobody checked all ~=3D 160,000+ =
msgs.  As
will be seen below, it's not clean enough.

The spam archive:  This is essentially all of Bruce Guenter's 2002 sp=
collection, at <http://www.em.ca/~bruceg/spam/>.  It was broken at ra=
into 5 sets of 2,750 spams each.

Problems included:

+ Mailman added distinctive headers to every message in the ham
  archive, which appear nowhere in the spam archive.  A Bayesian
  classifier picks up on that immediately.

+ Mailman also adds "[name-of-list]" to every Subject line.

+ The spam headers had tons of clues about Bruce Guenter's mailing
  addresses that appear nowhere in the ham headers.

+ The spam archive has Windows line ends (\r\n), but the ham archive
  plain Unix \n.  This turned out to be a killer clue(!) in the simpl=
  character n-gram attempts.  (Note:  I can't use text mode to read
  msgs, because there are binary characters in the archives that
  Windows treats as EOF in text mode -- indeed, 400MB of the ham
  archive vanishes when read in text mode!)

What I'm reporting on here is after normalizing all line-ends to \n, =
ignoring the headers *completely*.  There are obviously good clues in=
headers, the problem is that they're killer-good clues for accidental
reasons in this test data.  I don't want to write code to suppress th=
clues either, as then I'd be testing some mix of my insights (or lack
thereof) with what a blind classifier would do.  But I don't care how=
 good I
am, I only care about how well the algorithm does.

Since it's ignoring the headers, I think it's safe to view this as a =
bound on what can be achieved.  There's another way this should be a =

def tokenize_split(string):
    for w in string.split():
        yield w

tokenize =3D tokenize_split

class Msg(object):
    def __init__(self, dir, name):
        path =3D dir + "/" + name
        self.path =3D path
        f =3D file(path, 'rb')
        guts =3D f.read()
        # Skip the headers.
        i =3D guts.find('\n\n')
        if i >=3D 0:
            guts =3D guts[i+2:]
        self.guts =3D guts

    def __iter__(self):
        return tokenize(self.guts)

This is about the stupidest tokenizer imaginable, merely splitting th=
e body
on whitespace.  Here's the output from the first run, training agains=
t one
pair of spam+ham groups, then seeing how its predictions stack up aga=
each of the four other pairs of spam+ham groups:

Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 s=
    testing against Data/Spam/Set2 and Data/Ham/Set2
    tested 4000 hams and 2750 spams
    false positive: 0.00725 (i.e., under 1%)
    false negative: 0.0530909090909 (i.e., over 5%)

    testing against Data/Spam/Set3 and Data/Ham/Set3
    tested 4000 hams and 2750 spams
    false positive: 0.007
    false negative: 0.056

    testing against Data/Spam/Set4 and Data/Ham/Set4
    tested 4000 hams and 2750 spams
    false positive: 0.0065
    false negative: 0.0545454545455

    testing against Data/Spam/Set5 and Data/Ham/Set5
    tested 4000 hams and 2750 spams
    false positive: 0.00675
    false negative: 0.0516363636364

It's a Good Sign that the false positive/negative rates are very clos=
across the four test runs.  It's possible to quantify just how good a=
that is, but they're so close by eyeball that there's no point in bot=

This is using the new Tester.py in the sandbox, and that class automa=
remembers the false positives and negatives.  Here's the start of the=
false positive from the first run:

It's not really hard!!
Turn $6.00 into $1,000 or more...read this to find out how!! READING
THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board
to try it. A little while back, while chatting on the internet, I cam=
across an article
similar to this that said you could make thousands of dollars in cash
within weeks
with only an initial investment of $6.00! So I thought, "Yeah right,
this must be a scam", but like most of us, I was curious, so I kept
reading. Anyway,
it said that you send $1.00 to each of the six names and address
statedin the
article. You then place your own name and address in the bottom of th=
list at #6, and
post the article in at least 200 newsgroups (There are thousands) or
e-mail them. No

Call me forgiving, but I think it's vaguely possible that this should=
been in the spam corpus instead <wink>.

Here's the start of the second false positive:

Please forward this message to anyone you know who is active in the s=

See Below for Press Release

Dear Friends,

I am a normal investor same as you.  I am not a finance  professional=
 nor am
I connected to FDNI in any way.

I recently stumbled onto this OTC stock (FDNI) while searching throug=
h yahoo
for small float, big potential stocks. At the time, the company had r=
a press release which stated they were doing a stock buyback.  Intrig=
ued, I
bought 5,000 shares at $.75 each.  The stock went to $1.50 and I sold=
shares.  I then bought them back at $1.15.  The company then circulat=
another press release about a foreign acquisition (see below).  The s=
jumped to $2.75 (I sold @ $2.50 for a massive profit).  I then bought=
in at $1.25 where I am holding until the next major piece of news.

Here's the start of the third:

Grand Treasure Industrial Limited

Contact Information

We are a manufacturer and exporter in Hong Kong for all kinds of plas=
We export to worldwide markets. Recently , we join-ventured with a ba=
factory in China produce all kinds of shopping , lady's , traveller's
bags.... visit our page and send us your enquiry by email now.
Contact Address :
Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong.
Telephone : ( 852 ) 2408 9382

That is, all the "false positives" there are blatant spam.  It will t=
ake a
long time to sort this all out, but I want to make a point here now: =
classifier works so well that it can *help* clean the ham corpus!  I =
found a non-spam among the "false positives" yet.  Another lesson rei=
one from my previous life in speech recognition:  rigorous data colle=
cleaning, tagging and maintenance is crucial when working with statis=
approaches, and is damned expensive to do.

Here's the start of the first "false negative" (including the headers=

Return-Path: <911@911.COM>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0=
Received: from unknown (HELO PC-5.) (
  by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000
x-esmtp: 0 0 1
Message-ID: <1604543-22002702894513952@smtp.vip.sina.com>
To: "NEW020515" <911@911.COM>
=46rom: "=D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata=
base.net =A3=A9" <911@911.COM>
Subject: =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata=
base.net =A3=A9
Date: Sun, 28 Jul 2002 17:45:13 +0800
MIME-Version: 1.0
Content-type: text/plain; charset=3Dgb2312
Content-Transfer-Encoding: quoted-printable
Content-Length: 977


Since I'm ignoring the headers, and the tokenizer is just a whitespac=
split, each line of quoted-printable looks like a single word to the
classifier.  Since it's never seen these "words" before, it has no re=
ason to
believe they're either spam or ham indicators, and favors calling it =

One more mondo cool thing and that's it for now.  The GrahamBayes cla=
keeps track of how many times each word makes it into the list of the=
strongest indicators.  These are the "killer clues" the classifier ge=
ts the
most value from.  The most valuable spam indicator turned out to be
"<br>" -- there's simply almost no HTML mail in the ham archive (but =
that this clue would be missed if you stripped HTML!).  You're never =
to guess what the most valuable non-spam indicator was, but it's quit=
plausible after you see it.  Go ahead, guess.  Chicken <wink>.

Here are the 15 most-used killer clues across the runs shown above:  =
repr of the word, followed by the # of times it made into the 15-best=
and the estimated probability that a msg is spam if it contains this =

    testing against Data/Spam/Set2 and Data/Ham/Set2
    best discrimators:
        'Helvetica,' 243 0.99
        'object' 245 0.01
        'language' 258 0.01
        '<BR>' 292 0.99
        '>' 339 0.179104
        'def' 397 0.01
        'article' 423 0.01
        'module' 436 0.01
        'import' 499 0.01
        '<br>' 652 0.99
        '>>>' 667 0.01
        'wrote' 677 0.01
        'python' 755 0.01
        'Python' 1947 0.01
        'wrote:' 1988 0.01

    testing against Data/Spam/Set3 and Data/Ham/Set3
    best discrimators:
        'string' 494 0.01
        'Helvetica,' 496 0.99
        'language' 524 0.01
        '<BR>' 553 0.99
        '>' 687 0.179104
        'article' 851 0.01
        'module' 857 0.01
        'def' 875 0.01
        'import' 1019 0.01
        '<br>' 1288 0.99
        '>>>' 1344 0.01
        'wrote' 1355 0.01
        'python' 1461 0.01
        'Python' 3858 0.01
        'wrote:' 3984 0.01

    testing against Data/Spam/Set4 and Data/Ham/Set4
    best discrimators:
        'object' 749 0.01
        'Helvetica,' 757 0.99
        'language' 763 0.01
        '<BR>' 877 0.99
        '>' 954 0.179104
        'article' 1240 0.01
        'module' 1260 0.01
        'def' 1364 0.01
        'import' 1517 0.01
        '<br>' 1765 0.99
        '>>>' 1999 0.01
        'wrote' 2071 0.01
        'python' 2160 0.01
        'Python' 5848 0.01
        'wrote:' 6021 0.01

    testing against Data/Spam/Set5 and Data/Ham/Set5
    best discrimators:
        'object' 980 0.01
        'language' 992 0.01
        'Helvetica,' 1005 0.99
        '<BR>' 1139 0.99
        '>' 1257 0.179104
        'article' 1678 0.01
        'module' 1702 0.01
        'def' 1846 0.01
        'import' 2003 0.01
        '<br>' 2387 0.99
        '>>>' 2624 0.01
        'wrote' 2743 0.01
        'python' 2864 0.01
        'Python' 7830 0.01
        'wrote:' 8060 0.01

Note that an "intelligent" tokenizer would likely miss that the Pytho=
prompt ('>>>') is a great non-spam indicator on python-list.  I've ha=
d this
argument with some of you before <wink>, but the best way to let this=
of thing be as intelligent as it can be is not to try to help it too =
it will learn things you'll never dream of, provided only you don't f=
clues out in an attempt to be clever.

everything's-a-clue-ly y'rs  - tim