[Python-Dev] The first trustworthy <wink> GBayes results
Tim Peters
tim.one@comcast.net
Tue, 27 Aug 2002 22:36:17 -0400
Setting this up has been a bitch. All early attempts floundered beca=
use it
turned out there was *some* systematic difference between the ham and=
spam
archives that made the job trivial.
The ham archive: I selected 20,000 messages, and broke them into 5 s=
ets of
4,000 each, at random, from a python-list archive Barry put together,
containing msgs only after SpamAssassin was put into play on python.o=
rg.
It's hoped that's pretty clean, but nobody checked all ~=3D 160,000+ =
msgs. As
will be seen below, it's not clean enough.
The spam archive: This is essentially all of Bruce Guenter's 2002 sp=
am
collection, at <http://www.em.ca/~bruceg/spam/>. It was broken at ra=
ndom
into 5 sets of 2,750 spams each.
Problems included:
+ Mailman added distinctive headers to every message in the ham
archive, which appear nowhere in the spam archive. A Bayesian
classifier picks up on that immediately.
+ Mailman also adds "[name-of-list]" to every Subject line.
+ The spam headers had tons of clues about Bruce Guenter's mailing
addresses that appear nowhere in the ham headers.
+ The spam archive has Windows line ends (\r\n), but the ham archive
plain Unix \n. This turned out to be a killer clue(!) in the simpl=
est
character n-gram attempts. (Note: I can't use text mode to read
msgs, because there are binary characters in the archives that
Windows treats as EOF in text mode -- indeed, 400MB of the ham
archive vanishes when read in text mode!)
What I'm reporting on here is after normalizing all line-ends to \n, =
and
ignoring the headers *completely*. There are obviously good clues in=
the
headers, the problem is that they're killer-good clues for accidental
reasons in this test data. I don't want to write code to suppress th=
ese
clues either, as then I'd be testing some mix of my insights (or lack
thereof) with what a blind classifier would do. But I don't care how=
good I
am, I only care about how well the algorithm does.
Since it's ignoring the headers, I think it's safe to view this as a =
lower
bound on what can be achieved. There's another way this should be a =
lower
bound:
def tokenize_split(string):
for w in string.split():
yield w
tokenize =3D tokenize_split
class Msg(object):
def __init__(self, dir, name):
path =3D dir + "/" + name
self.path =3D path
f =3D file(path, 'rb')
guts =3D f.read()
f.close()
# Skip the headers.
i =3D guts.find('\n\n')
if i >=3D 0:
guts =3D guts[i+2:]
self.guts =3D guts
def __iter__(self):
return tokenize(self.guts)
This is about the stupidest tokenizer imaginable, merely splitting th=
e body
on whitespace. Here's the output from the first run, training agains=
t one
pair of spam+ham groups, then seeing how its predictions stack up aga=
inst
each of the four other pairs of spam+ham groups:
Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 s=
pams
testing against Data/Spam/Set2 and Data/Ham/Set2
tested 4000 hams and 2750 spams
false positive: 0.00725 (i.e., under 1%)
false negative: 0.0530909090909 (i.e., over 5%)
testing against Data/Spam/Set3 and Data/Ham/Set3
tested 4000 hams and 2750 spams
false positive: 0.007
false negative: 0.056
testing against Data/Spam/Set4 and Data/Ham/Set4
tested 4000 hams and 2750 spams
false positive: 0.0065
false negative: 0.0545454545455
testing against Data/Spam/Set5 and Data/Ham/Set5
tested 4000 hams and 2750 spams
false positive: 0.00675
false negative: 0.0516363636364
It's a Good Sign that the false positive/negative rates are very clos=
e
across the four test runs. It's possible to quantify just how good a=
sign
that is, but they're so close by eyeball that there's no point in bot=
hering.
This is using the new Tester.py in the sandbox, and that class automa=
tically
remembers the false positives and negatives. Here's the start of the=
first
false positive from the first run:
"""
It's not really hard!!
Turn $6.00 into $1,000 or more...read this to find out how!! READING
THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board
anddecided
to try it. A little while back, while chatting on the internet, I cam=
e
across an article
similar to this that said you could make thousands of dollars in cash
within weeks
with only an initial investment of $6.00! So I thought, "Yeah right,
this must be a scam", but like most of us, I was curious, so I kept
reading. Anyway,
it said that you send $1.00 to each of the six names and address
statedin the
article. You then place your own name and address in the bottom of th=
e
list at #6, and
post the article in at least 200 newsgroups (There are thousands) or
e-mail them. No
"""
Call me forgiving, but I think it's vaguely possible that this should=
have
been in the spam corpus instead <wink>.
Here's the start of the second false positive:
"""
Please forward this message to anyone you know who is active in the s=
tock
market.
See Below for Press Release
xXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX
Dear Friends,
I am a normal investor same as you. I am not a finance professional=
nor am
I connected to FDNI in any way.
I recently stumbled onto this OTC stock (FDNI) while searching throug=
h yahoo
for small float, big potential stocks. At the time, the company had r=
eleased
a press release which stated they were doing a stock buyback. Intrig=
ued, I
bought 5,000 shares at $.75 each. The stock went to $1.50 and I sold=
my
shares. I then bought them back at $1.15. The company then circulat=
ed
another press release about a foreign acquisition (see below). The s=
tock
jumped to $2.75 (I sold @ $2.50 for a massive profit). I then bought=
back
in at $1.25 where I am holding until the next major piece of news.
"""
Here's the start of the third:
"""
Grand Treasure Industrial Limited
Contact Information
We are a manufacturer and exporter in Hong Kong for all kinds of plas=
tic
products,
We export to worldwide markets. Recently , we join-ventured with a ba=
g
factory in China produce all kinds of shopping , lady's , traveller's
bags.... visit our page and send us your enquiry by email now.
Contact Address :
Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong.
Telephone : ( 852 ) 2408 9382
"""
That is, all the "false positives" there are blatant spam. It will t=
ake a
long time to sort this all out, but I want to make a point here now: =
the
classifier works so well that it can *help* clean the ham corpus! I =
haven't
found a non-spam among the "false positives" yet. Another lesson rei=
nforces
one from my previous life in speech recognition: rigorous data colle=
ction,
cleaning, tagging and maintenance is crucial when working with statis=
ical
approaches, and is damned expensive to do.
Here's the start of the first "false negative" (including the headers=
):
"""
Return-Path: <911@911.COM>
Delivered-To: em-ca-bruceg@em.ca
Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0=
000
Received: from unknown (HELO PC-5.) (61.48.16.65)
by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000
x-esmtp: 0 0 1
Message-ID: <1604543-22002702894513952@smtp.vip.sina.com>
To: "NEW020515" <911@911.COM>
=46rom: "=D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata=
base.net =A3=A9" <911@911.COM>
Subject: =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www.itdata=
base.net =A3=A9
Date: Sun, 28 Jul 2002 17:45:13 +0800
MIME-Version: 1.0
Content-type: text/plain; charset=3Dgb2312
Content-Transfer-Encoding: quoted-printable
Content-Length: 977
=3DD6=3DD0=3DB9=3DFAIT=3DCA=3DFD=3DBE=3DDD=3DBF=3DE2=3DCD=3DF8=3DD5=
=3DBE=3DA3=3DA8www=3D2Eitdatabase=3D2Enet
=3DA3=3D
=3DA9=3DCC=3DE1=3DB9=3DA9=3DB4=3DF3=3DC1=3DBF=3DD3=3DD0=3DB9=3DD8=
=3DD6=3DD0=3DB9=3DFAIT/=3DCD=3DA8=3DD0=3DC5=3DCA=3DD0=3DB3=3D
=3DA1=3DD2=3DD4=3DBC=3DB0=3DC8=3DAB=3DC7=3DF2IT/=3DCD=3DA8=3DD0=3DC5=
=3DCA=3DD0=3DB3=3DA1=3DB5=3DC4=3DCF=3DE0=3DB9=3DD8=3DCA=3D
=3DFD=3DBE=3DDD=3DBA=3DCD=3DB7=3DD6=3DCE=3DF6=3DA1=3DA3
=3DB1=3DBE=3DCD=3DF8=3DD5=3DBE=3DC9=3DE6=3DBC=3DB0=3DD3=3DD0=3DB9=
=3DD8=3D
=3DB5=3DE7=3DD0=3DC5=3DD4=3DCB=3DD3=3DAA=3DCA=3DD0=3DB3=3DA1=3DA1=
=3DA2=3DB5=3DE7=3DD0=3DC5=3DD4=3DCB=3DD3=3DAA=3DC9=3DCC=3DA1=3D
"""
Since I'm ignoring the headers, and the tokenizer is just a whitespac=
e
split, each line of quoted-printable looks like a single word to the
classifier. Since it's never seen these "words" before, it has no re=
ason to
believe they're either spam or ham indicators, and favors calling it =
ham.
One more mondo cool thing and that's it for now. The GrahamBayes cla=
ss
keeps track of how many times each word makes it into the list of the=
15
strongest indicators. These are the "killer clues" the classifier ge=
ts the
most value from. The most valuable spam indicator turned out to be
"<br>" -- there's simply almost no HTML mail in the ham archive (but =
note
that this clue would be missed if you stripped HTML!). You're never =
going
to guess what the most valuable non-spam indicator was, but it's quit=
e
plausible after you see it. Go ahead, guess. Chicken <wink>.
Here are the 15 most-used killer clues across the runs shown above: =
the
repr of the word, followed by the # of times it made into the 15-best=
list,
and the estimated probability that a msg is spam if it contains this =
word:
testing against Data/Spam/Set2 and Data/Ham/Set2
best discrimators:
'Helvetica,' 243 0.99
'object' 245 0.01
'language' 258 0.01
'<BR>' 292 0.99
'>' 339 0.179104
'def' 397 0.01
'article' 423 0.01
'module' 436 0.01
'import' 499 0.01
'<br>' 652 0.99
'>>>' 667 0.01
'wrote' 677 0.01
'python' 755 0.01
'Python' 1947 0.01
'wrote:' 1988 0.01
testing against Data/Spam/Set3 and Data/Ham/Set3
best discrimators:
'string' 494 0.01
'Helvetica,' 496 0.99
'language' 524 0.01
'<BR>' 553 0.99
'>' 687 0.179104
'article' 851 0.01
'module' 857 0.01
'def' 875 0.01
'import' 1019 0.01
'<br>' 1288 0.99
'>>>' 1344 0.01
'wrote' 1355 0.01
'python' 1461 0.01
'Python' 3858 0.01
'wrote:' 3984 0.01
testing against Data/Spam/Set4 and Data/Ham/Set4
best discrimators:
'object' 749 0.01
'Helvetica,' 757 0.99
'language' 763 0.01
'<BR>' 877 0.99
'>' 954 0.179104
'article' 1240 0.01
'module' 1260 0.01
'def' 1364 0.01
'import' 1517 0.01
'<br>' 1765 0.99
'>>>' 1999 0.01
'wrote' 2071 0.01
'python' 2160 0.01
'Python' 5848 0.01
'wrote:' 6021 0.01
testing against Data/Spam/Set5 and Data/Ham/Set5
best discrimators:
'object' 980 0.01
'language' 992 0.01
'Helvetica,' 1005 0.99
'<BR>' 1139 0.99
'>' 1257 0.179104
'article' 1678 0.01
'module' 1702 0.01
'def' 1846 0.01
'import' 2003 0.01
'<br>' 2387 0.99
'>>>' 2624 0.01
'wrote' 2743 0.01
'python' 2864 0.01
'Python' 7830 0.01
'wrote:' 8060 0.01
Note that an "intelligent" tokenizer would likely miss that the Pytho=
n
prompt ('>>>') is a great non-spam indicator on python-list. I've ha=
d this
argument with some of you before <wink>, but the best way to let this=
kind
of thing be as intelligent as it can be is not to try to help it too =
much:
it will learn things you'll never dream of, provided only you don't f=
ilter
clues out in an attempt to be clever.
everything's-a-clue-ly y'rs - tim