Setting this up has been a bitch. All early attempts floundered because it turned out there was *some* systematic difference between the ham and spam archives that made the job trivial.
The ham archive: I selected 20,000 messages, and broke them into 5 sets of 4,000 each, at random, from a python-list archive Barry put together, containing msgs only after SpamAssassin was put into play on python.org. It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs. As will be seen below, it's not clean enough.
The spam archive: This is essentially all of Bruce Guenter's 2002 spam collection, at http://www.em.ca/~bruceg/spam/. It was broken at random into 5 sets of 2,750 spams each.
+ Mailman added distinctive headers to every message in the ham archive, which appear nowhere in the spam archive. A Bayesian classifier picks up on that immediately.
+ Mailman also adds "[name-of-list]" to every Subject line.
+ The spam headers had tons of clues about Bruce Guenter's mailing addresses that appear nowhere in the ham headers.
+ The spam archive has Windows line ends (\r\n), but the ham archive plain Unix \n. This turned out to be a killer clue(!) in the simplest character n-gram attempts. (Note: I can't use text mode to read msgs, because there are binary characters in the archives that Windows treats as EOF in text mode -- indeed, 400MB of the ham archive vanishes when read in text mode!)
What I'm reporting on here is after normalizing all line-ends to \n, and ignoring the headers *completely*. There are obviously good clues in the headers, the problem is that they're killer-good clues for accidental reasons in this test data. I don't want to write code to suppress these clues either, as then I'd be testing some mix of my insights (or lack thereof) with what a blind classifier would do. But I don't care how good I am, I only care about how well the algorithm does.
Since it's ignoring the headers, I think it's safe to view this as a lower bound on what can be achieved. There's another way this should be a lower bound:
def tokenize_split(string): for w in string.split(): yield w
tokenize = tokenize_split
class Msg(object): def __init__(self, dir, name): path = dir + "/" + name self.path = path f = file(path, 'rb') guts = f.read() f.close() # Skip the headers. i = guts.find('\n\n') if i >= 0: guts = guts[i+2:] self.guts = guts
def __iter__(self): return tokenize(self.guts)
This is about the stupidest tokenizer imaginable, merely splitting the body on whitespace. Here's the output from the first run, training against one pair of spam+ham groups, then seeing how its predictions stack up against each of the four other pairs of spam+ham groups:
Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 spams testing against Data/Spam/Set2 and Data/Ham/Set2 tested 4000 hams and 2750 spams false positive: 0.00725 (i.e., under 1%) false negative: 0.0530909090909 (i.e., over 5%)
testing against Data/Spam/Set3 and Data/Ham/Set3 tested 4000 hams and 2750 spams false positive: 0.007 false negative: 0.056
testing against Data/Spam/Set4 and Data/Ham/Set4 tested 4000 hams and 2750 spams false positive: 0.0065 false negative: 0.0545454545455
testing against Data/Spam/Set5 and Data/Ham/Set5 tested 4000 hams and 2750 spams false positive: 0.00675 false negative: 0.0516363636364
It's a Good Sign that the false positive/negative rates are very close across the four test runs. It's possible to quantify just how good a sign that is, but they're so close by eyeball that there's no point in bothering.
This is using the new Tester.py in the sandbox, and that class automatically remembers the false positives and negatives. Here's the start of the first false positive from the first run:
""" It's not really hard!! Turn $6.00 into $1,000 or more...read this to find out how!! READING THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board anddecided to try it. A little while back, while chatting on the internet, I came across an article similar to this that said you could make thousands of dollars in cash within weeks with only an initial investment of $6.00! So I thought, "Yeah right, this must be a scam", but like most of us, I was curious, so I kept reading. Anyway, it said that you send $1.00 to each of the six names and address statedin the article. You then place your own name and address in the bottom of the list at #6, and post the article in at least 200 newsgroups (There are thousands) or e-mail them. No """
Call me forgiving, but I think it's vaguely possible that this should have been in the spam corpus instead <wink>.
Here's the start of the second false positive:
""" Please forward this message to anyone you know who is active in the stock market.
See Below for Press Release xXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX
I am a normal investor same as you. I am not a finance professional nor am I connected to FDNI in any way.
I recently stumbled onto this OTC stock (FDNI) while searching through yahoo for small float, big potential stocks. At the time, the company had released a press release which stated they were doing a stock buyback. Intrigued, I bought 5,000 shares at $.75 each. The stock went to $1.50 and I sold my shares. I then bought them back at $1.15. The company then circulated another press release about a foreign acquisition (see below). The stock jumped to $2.75 (I sold @ $2.50 for a massive profit). I then bought back in at $1.25 where I am holding until the next major piece of news. """
Here's the start of the third:
""" Grand Treasure Industrial Limited
We are a manufacturer and exporter in Hong Kong for all kinds of plastic products, We export to worldwide markets. Recently , we join-ventured with a bag factory in China produce all kinds of shopping , lady's , traveller's bags.... visit our page and send us your enquiry by email now. Contact Address : Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong. Telephone : ( 852 ) 2408 9382 """
That is, all the "false positives" there are blatant spam. It will take a long time to sort this all out, but I want to make a point here now: the classifier works so well that it can *help* clean the ham corpus! I haven't found a non-spam among the "false positives" yet. Another lesson reinforces one from my previous life in speech recognition: rigorous data collection, cleaning, tagging and maintenance is crucial when working with statisical approaches, and is damned expensive to do.
Here's the start of the first "false negative" (including the headers):
""" Return-Path: 911@911.COM Delivered-To: email@example.com Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0000 Received: from unknown (HELO PC-5.) (188.8.131.52) by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000 x-esmtp: 0 0 1 Message-ID: firstname.lastname@example.org To: "NEW020515" 911@911.COM From: "ÖÐ¹úITÊý¾Ý¿âÍøÕ¾£¨www.itdatabase.net £©" 911@911.COM Subject: ÖÐ¹úITÊý¾Ý¿âÍøÕ¾£¨www.itdatabase.net £© Date: Sun, 28 Jul 2002 17:45:13 +0800 MIME-Version: 1.0 Content-type: text/plain; charset=gb2312 Content-Transfer-Encoding: quoted-printable Content-Length: 977
=D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www=2Eitdatabase=2Enet =A3= =A9=CC=E1=B9=A9=B4=F3=C1=BF=D3=D0=B9=D8=D6=D0=B9=FAIT/=CD=A8=D0=C5=CA=D0=B3= =A1=D2=D4=BC=B0=C8=AB=C7=F2IT/=CD=A8=D0=C5=CA=D0=B3=A1=B5=C4=CF=E0=B9=D8=CA= =FD=BE=DD=BA=CD=B7=D6=CE=F6=A1=A3 =B1=BE=CD=F8=D5=BE=C9=E6=BC=B0=D3=D0=B9=D8= =B5=E7=D0=C5=D4=CB=D3=AA=CA=D0=B3=A1=A1=A2=B5=E7=D0=C5=D4=CB=D3=AA=C9=CC=A1= """
Since I'm ignoring the headers, and the tokenizer is just a whitespace split, each line of quoted-printable looks like a single word to the classifier. Since it's never seen these "words" before, it has no reason to believe they're either spam or ham indicators, and favors calling it ham.
One more mondo cool thing and that's it for now. The GrahamBayes class keeps track of how many times each word makes it into the list of the 15 strongest indicators. These are the "killer clues" the classifier gets the most value from. The most valuable spam indicator turned out to be "<br>" -- there's simply almost no HTML mail in the ham archive (but note that this clue would be missed if you stripped HTML!). You're never going to guess what the most valuable non-spam indicator was, but it's quite plausible after you see it. Go ahead, guess. Chicken <wink>.
Here are the 15 most-used killer clues across the runs shown above: the repr of the word, followed by the # of times it made into the 15-best list, and the estimated probability that a msg is spam if it contains this word:
testing against Data/Spam/Set2 and Data/Ham/Set2 best discrimators: 'Helvetica,' 243 0.99 'object' 245 0.01 'language' 258 0.01 '<BR>' 292 0.99 '>' 339 0.179104 'def' 397 0.01 'article' 423 0.01 'module' 436 0.01 'import' 499 0.01 '<br>' 652 0.99 '>>>' 667 0.01 'wrote' 677 0.01 'python' 755 0.01 'Python' 1947 0.01 'wrote:' 1988 0.01
testing against Data/Spam/Set3 and Data/Ham/Set3 best discrimators: 'string' 494 0.01 'Helvetica,' 496 0.99 'language' 524 0.01 '<BR>' 553 0.99 '>' 687 0.179104 'article' 851 0.01 'module' 857 0.01 'def' 875 0.01 'import' 1019 0.01 '<br>' 1288 0.99 '>>>' 1344 0.01 'wrote' 1355 0.01 'python' 1461 0.01 'Python' 3858 0.01 'wrote:' 3984 0.01
testing against Data/Spam/Set4 and Data/Ham/Set4 best discrimators: 'object' 749 0.01 'Helvetica,' 757 0.99 'language' 763 0.01 '<BR>' 877 0.99 '>' 954 0.179104 'article' 1240 0.01 'module' 1260 0.01 'def' 1364 0.01 'import' 1517 0.01 '<br>' 1765 0.99 '>>>' 1999 0.01 'wrote' 2071 0.01 'python' 2160 0.01 'Python' 5848 0.01 'wrote:' 6021 0.01
testing against Data/Spam/Set5 and Data/Ham/Set5 best discrimators: 'object' 980 0.01 'language' 992 0.01 'Helvetica,' 1005 0.99 '<BR>' 1139 0.99 '>' 1257 0.179104 'article' 1678 0.01 'module' 1702 0.01 'def' 1846 0.01 'import' 2003 0.01 '<br>' 2387 0.99 '>>>' 2624 0.01 'wrote' 2743 0.01 'python' 2864 0.01 'Python' 7830 0.01 'wrote:' 8060 0.01
Note that an "intelligent" tokenizer would likely miss that the Python prompt ('>>>') is a great non-spam indicator on python-list. I've had this argument with some of you before <wink>, but the best way to let this kind of thing be as intelligent as it can be is not to try to help it too much: it will learn things you'll never dream of, provided only you don't filter clues out in an attempt to be clever.
everything's-a-clue-ly y'rs - tim