[Python-Dev] The first trustworthy <wink> GBayes results

Tue, 27 Aug 2002 22:36:17 -0400

Setting this up has been a bitch.  All early attempts floundered beca=
use it
turned out there was *some* systematic difference between the ham and=
 spam
archives that made the job trivial.

The ham archive:  I selected 20,000 messages, and broke them into 5 s=
ets of
4,000 each, at random, from a python-list archive Barry put together,
containing msgs only after SpamAssassin was put into play on python.o=
rg.
It's hoped that's pretty clean, but nobody checked all ~=3D 160,000+ =
msgs.  As
will be seen below, it's not clean enough.

The spam archive:  This is essentially all of Bruce Guenter's 2002 sp=
am
collection, at <http://www.em.ca/~bruceg/spam/>.  It was broken at ra=
ndom
into 5 sets of 2,750 spams each.

Problems included:

+ Mailman added distinctive headers to every message in the ham
  archive, which appear nowhere in the spam archive.  A Bayesian
  classifier picks up on that immediately.

+ Mailman also adds "[name-of-list]" to every Subject line.

+ The spam headers had tons of clues about Bruce Guenter's mailing
  addresses that appear nowhere in the ham headers.

+ The spam archive has Windows line ends (\r\n), but the ham archive
  plain Unix \n.  This turned out to be a killer clue(!) in the simpl=
est
  character n-gram attempts.  (Note:  I can't use text mode to read
  msgs, because there are binary characters in the archives that
  Windows treats as EOF in text mode -- indeed, 400MB of the ham
  archive vanishes when read in text mode!)

What I'm reporting on here is after normalizing all line-ends to \n, =
and
ignoring the headers *completely*.  There are obviously good clues in=
 the
headers, the problem is that they're killer-good clues for accidental
reasons in this test data.  I don't want to write code to suppress th=
ese
clues either, as then I'd be testing some mix of my insights (or lack
thereof) with what a blind classifier would do.  But I don't care how=
 good I
am, I only care about how well the algorithm does.

Since it's ignoring the headers, I think it's safe to view this as a =
lower
bound on what can be achieved.  There's another way this should be a =
lower
bound:

def tokenize_split(string):
    for w in string.split():
        yield w

tokenize =3D tokenize_split

class Msg(object):
    def __init__(self, dir, name):
        path =3D dir + "/" + name
        self.path =3D path
        f =3D file(path, 'rb')
        guts =3D f.read()
        f.close()
        # Skip the headers.
        i =3D guts.find('\n\n')
        if i >=3D 0:
            guts =3D guts[i+2:]
        self.guts =3D guts

    def __iter__(self):
        return tokenize(self.guts)

This is about the stupidest tokenizer imaginable, merely splitting th=
e body
on whitespace.  Here's the output from the first run, training agains=
t one
pair of spam+ham groups, then seeing how its predictions stack up aga=
inst
each of the four other pairs of spam+ham groups:

Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 s=
pams
    testing against Data/Spam/Set2 and Data/Ham/Set2
    tested 4000 hams and 2750 spams
    false positive: 0.00725 (i.e., under 1%)
    false negative: 0.0530909090909 (i.e., over 5%)

    testing against Data/Spam/Set3 and Data/Ham/Set3
    tested 4000 hams and 2750 spams
    false positive: 0.007
    false negative: 0.056

    testing against Data/Spam/Set4 and Data/Ham/Set4
    tested 4000 hams and 2750 spams
    false positive: 0.0065
    false negative: 0.0545454545455

    testing against Data/Spam/Set5 and Data/Ham/Set5
    tested 4000 hams and 2750 spams
    false positive: 0.00675
    false negative: 0.0516363636364

It's a Good Sign that the false positive/negative rates are very clos=
e
across the four test runs.  It's possible to quantify just how good a=
 sign
that is, but they're so close by eyeball that there's no point in bot=
hering.

This is using the new Tester.py in the sandbox, and that class automa=
tically
remembers the false positives and negatives.  Here's the start of the=
 first
false positive from the first run:

"""
It's not really hard!!
Turn $6.00 into $1,000 or more...read this to find out how!! READING
THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board
anddecided
to try it. A little while back, while chatting on the internet, I cam=
e
across an article
similar to this that said you could make thousands of dollars in cash
within weeks
with only an initial investment of $6.00! So I thought, "Yeah right,
this must be a scam", but like most of us, I was curious, so I kept
reading. Anyway,
it said that you send $1.00 to each of the six names and address
statedin the
article. You then place your own name and address in the bottom of th=
e
list at #6, and
post the article in at least 200 newsgroups (There are thousands) or
e-mail them. No
"""

Call me forgiving, but I think it's vaguely possible that this should=
 have
been in the spam corpus instead <wink>.

Here's the start of the second false positive:

"""
Please forward this message to anyone you know who is active in the s=
tock
market.

See Below for Press Release
xXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX

Dear Friends,

I am a normal investor same as you.  I am not a finance  professional=
 nor am
I connected to FDNI in any way.

I recently stumbled onto this OTC stock (FDNI) while searching throug=
h yahoo
for small float, big potential stocks. At the time, the company had r=
eleased
a press release which stated they were doing a stock buyback.  Intrig=
ued, I
bought 5,000 shares at $.75 each.  The stock went to $1.50 and I sold=
 my
shares.  I then bought them back at $1.15.  The company then circulat=
ed
another press release about a foreign acquisition (see below).  The s=
tock
jumped to $2.75 (I sold @ $2.50 for a massive profit).  I then bought=
 back
in at $1.25 where I am holding until the next major piece of news.
"""

Here's the start of the third:

"""
Grand Treasure Industrial Limited

Contact Information

We are a manufacturer and exporter in Hong Kong for all kinds of plas=
tic
products,
We export to worldwide markets. Recently , we join-ventured with a ba=
g
factory in China produce all kinds of shopping , lady's , traveller's
bags.... visit our page and send us your enquiry by email now.
Contact Address :
Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong.
Telephone : ( 852 ) 2408 9382
"""

That is, all the "false positives" there are blatant spam.  It will t=
ake a
long time to sort this all out, but I want to make a point here now: =
 the
classifier works so well that it can *help* clean the ham corpus!  I =
haven't
found a non-spam among the "false positives" yet.  Another lesson rei=
nforces
one from my previous life in speech recognition:  rigorous data colle=
ction,
cleaning, tagging and maintenance is crucial when working with statis=
ical
approaches, and is damned expensive to do.

Here's the start of the first "false negative" (including the headers=
):

"""
Return-Path: <911@911.COM>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0=
000
Received: from unknown (HELO PC-5.) (61.48.16.65)
  by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000
x-esmtp: 0 0 1
Message-ID: <1604543-22002702894513952@smtp.vip.sina.com>
To: "NEW020515" <911@911.COM>
=46rom: "=D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata=
base.net =A3=A9" <911@911.COM>
Subject: =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata=
base.net =A3=A9
Date: Sun, 28 Jul 2002 17:45:13 +0800
MIME-Version: 1.0
Content-type: text/plain; charset=3Dgb2312
Content-Transfer-Encoding: quoted-printable
Content-Length: 977

=3DD6=3DD0=3DB9=3DFAIT=3DCA=3DFD=3DBE=3DDD=3DBF=3DE2=3DCD=3DF8=3DD5=
=3DBE=3DA3=3DA8www=3D2Eitdatabase=3D2Enet
=3DA3=3D
=3DA9=3DCC=3DE1=3DB9=3DA9=3DB4=3DF3=3DC1=3DBF=3DD3=3DD0=3DB9=3DD8=
=3DD6=3DD0=3DB9=3DFAIT/=3DCD=3DA8=3DD0=3DC5=3DCA=3DD0=3DB3=3D
=3DA1=3DD2=3DD4=3DBC=3DB0=3DC8=3DAB=3DC7=3DF2IT/=3DCD=3DA8=3DD0=3DC5=
=3DCA=3DD0=3DB3=3DA1=3DB5=3DC4=3DCF=3DE0=3DB9=3DD8=3DCA=3D
=3DFD=3DBE=3DDD=3DBA=3DCD=3DB7=3DD6=3DCE=3DF6=3DA1=3DA3
=3DB1=3DBE=3DCD=3DF8=3DD5=3DBE=3DC9=3DE6=3DBC=3DB0=3DD3=3DD0=3DB9=
=3DD8=3D
=3DB5=3DE7=3DD0=3DC5=3DD4=3DCB=3DD3=3DAA=3DCA=3DD0=3DB3=3DA1=3DA1=
=3DA2=3DB5=3DE7=3DD0=3DC5=3DD4=3DCB=3DD3=3DAA=3DC9=3DCC=3DA1=3D
"""

Since I'm ignoring the headers, and the tokenizer is just a whitespac=
e
split, each line of quoted-printable looks like a single word to the
classifier.  Since it's never seen these "words" before, it has no re=
ason to
believe they're either spam or ham indicators, and favors calling it =
ham.

One more mondo cool thing and that's it for now.  The GrahamBayes cla=
ss
keeps track of how many times each word makes it into the list of the=
 15
strongest indicators.  These are the "killer clues" the classifier ge=
ts the
most value from.  The most valuable spam indicator turned out to be
"<br>" -- there's simply almost no HTML mail in the ham archive (but =
note
that this clue would be missed if you stripped HTML!).  You're never =
going
to guess what the most valuable non-spam indicator was, but it's quit=
e
plausible after you see it.  Go ahead, guess.  Chicken <wink>.

Here are the 15 most-used killer clues across the runs shown above:  =
the
repr of the word, followed by the # of times it made into the 15-best=
 list,
and the estimated probability that a msg is spam if it contains this =
word:

    testing against Data/Spam/Set2 and Data/Ham/Set2
    best discrimators:
        'Helvetica,' 243 0.99
        'object' 245 0.01
        'language' 258 0.01
        '<BR>' 292 0.99
        '>' 339 0.179104
        'def' 397 0.01
        'article' 423 0.01
        'module' 436 0.01
        'import' 499 0.01
        '<br>' 652 0.99
        '>>>' 667 0.01
        'wrote' 677 0.01
        'python' 755 0.01
        'Python' 1947 0.01
        'wrote:' 1988 0.01

    testing against Data/Spam/Set3 and Data/Ham/Set3
    best discrimators:
        'string' 494 0.01
        'Helvetica,' 496 0.99
        'language' 524 0.01
        '<BR>' 553 0.99
        '>' 687 0.179104
        'article' 851 0.01
        'module' 857 0.01
        'def' 875 0.01
        'import' 1019 0.01
        '<br>' 1288 0.99
        '>>>' 1344 0.01
        'wrote' 1355 0.01
        'python' 1461 0.01
        'Python' 3858 0.01
        'wrote:' 3984 0.01

    testing against Data/Spam/Set4 and Data/Ham/Set4
    best discrimators:
        'object' 749 0.01
        'Helvetica,' 757 0.99
        'language' 763 0.01
        '<BR>' 877 0.99
        '>' 954 0.179104
        'article' 1240 0.01
        'module' 1260 0.01
        'def' 1364 0.01
        'import' 1517 0.01
        '<br>' 1765 0.99
        '>>>' 1999 0.01
        'wrote' 2071 0.01
        'python' 2160 0.01
        'Python' 5848 0.01
        'wrote:' 6021 0.01

    testing against Data/Spam/Set5 and Data/Ham/Set5
    best discrimators:
        'object' 980 0.01
        'language' 992 0.01
        'Helvetica,' 1005 0.99
        '<BR>' 1139 0.99
        '>' 1257 0.179104
        'article' 1678 0.01
        'module' 1702 0.01
        'def' 1846 0.01
        'import' 2003 0.01
        '<br>' 2387 0.99
        '>>>' 2624 0.01
        'wrote' 2743 0.01
        'python' 2864 0.01
        'Python' 7830 0.01
        'wrote:' 8060 0.01

Note that an "intelligent" tokenizer would likely miss that the Pytho=
n
prompt ('>>>') is a great non-spam indicator on python-list.  I've ha=
d this
argument with some of you before <wink>, but the best way to let this=
 kind
of thing be as intelligent as it can be is not to try to help it too =
much:
it will learn things you'll never dream of, provided only you don't f=
ilter
clues out in an attempt to be clever.

everything's-a-clue-ly y'rs  - tim