The first trustworthy <wink> GBayes results
Setting this up has been a bitch. All early attempts floundered because it turned out there was *some* systematic difference between the ham and spam archives that made the job trivial. The ham archive: I selected 20,000 messages, and broke them into 5 sets of 4,000 each, at random, from a python-list archive Barry put together, containing msgs only after SpamAssassin was put into play on python.org. It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs. As will be seen below, it's not clean enough. The spam archive: This is essentially all of Bruce Guenter's 2002 spam collection, at <http://www.em.ca/~bruceg/spam/>. It was broken at random into 5 sets of 2,750 spams each. Problems included: + Mailman added distinctive headers to every message in the ham archive, which appear nowhere in the spam archive. A Bayesian classifier picks up on that immediately. + Mailman also adds "[name-of-list]" to every Subject line. + The spam headers had tons of clues about Bruce Guenter's mailing addresses that appear nowhere in the ham headers. + The spam archive has Windows line ends (\r\n), but the ham archive plain Unix \n. This turned out to be a killer clue(!) in the simplest character n-gram attempts. (Note: I can't use text mode to read msgs, because there are binary characters in the archives that Windows treats as EOF in text mode -- indeed, 400MB of the ham archive vanishes when read in text mode!) What I'm reporting on here is after normalizing all line-ends to \n, and ignoring the headers *completely*. There are obviously good clues in the headers, the problem is that they're killer-good clues for accidental reasons in this test data. I don't want to write code to suppress these clues either, as then I'd be testing some mix of my insights (or lack thereof) with what a blind classifier would do. But I don't care how good I am, I only care about how well the algorithm does. Since it's ignoring the headers, I think it's safe to view this as a lower bound on what can be achieved. There's another way this should be a lower bound: def tokenize_split(string): for w in string.split(): yield w tokenize = tokenize_split class Msg(object): def __init__(self, dir, name): path = dir + "/" + name self.path = path f = file(path, 'rb') guts = f.read() f.close() # Skip the headers. i = guts.find('\n\n') if i >= 0: guts = guts[i+2:] self.guts = guts def __iter__(self): return tokenize(self.guts) This is about the stupidest tokenizer imaginable, merely splitting the body on whitespace. Here's the output from the first run, training against one pair of spam+ham groups, then seeing how its predictions stack up against each of the four other pairs of spam+ham groups: Training on Data/Ham/Set1 and Data/Spam/Set1 ... 4000 hams and 2750 spams testing against Data/Spam/Set2 and Data/Ham/Set2 tested 4000 hams and 2750 spams false positive: 0.00725 (i.e., under 1%) false negative: 0.0530909090909 (i.e., over 5%) testing against Data/Spam/Set3 and Data/Ham/Set3 tested 4000 hams and 2750 spams false positive: 0.007 false negative: 0.056 testing against Data/Spam/Set4 and Data/Ham/Set4 tested 4000 hams and 2750 spams false positive: 0.0065 false negative: 0.0545454545455 testing against Data/Spam/Set5 and Data/Ham/Set5 tested 4000 hams and 2750 spams false positive: 0.00675 false negative: 0.0516363636364 It's a Good Sign that the false positive/negative rates are very close across the four test runs. It's possible to quantify just how good a sign that is, but they're so close by eyeball that there's no point in bothering. This is using the new Tester.py in the sandbox, and that class automatically remembers the false positives and negatives. Here's the start of the first false positive from the first run: """ It's not really hard!! Turn $6.00 into $1,000 or more...read this to find out how!! READING THIS COULD CHANGE YOUR LIFE!! I found this on a bulletin board anddecided to try it. A little while back, while chatting on the internet, I came across an article similar to this that said you could make thousands of dollars in cash within weeks with only an initial investment of $6.00! So I thought, "Yeah right, this must be a scam", but like most of us, I was curious, so I kept reading. Anyway, it said that you send $1.00 to each of the six names and address statedin the article. You then place your own name and address in the bottom of the list at #6, and post the article in at least 200 newsgroups (There are thousands) or e-mail them. No """ Call me forgiving, but I think it's vaguely possible that this should have been in the spam corpus instead <wink>. Here's the start of the second false positive: """ Please forward this message to anyone you know who is active in the stock market. See Below for Press Release xXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX Dear Friends, I am a normal investor same as you. I am not a finance professional nor am I connected to FDNI in any way. I recently stumbled onto this OTC stock (FDNI) while searching through yahoo for small float, big potential stocks. At the time, the company had released a press release which stated they were doing a stock buyback. Intrigued, I bought 5,000 shares at $.75 each. The stock went to $1.50 and I sold my shares. I then bought them back at $1.15. The company then circulated another press release about a foreign acquisition (see below). The stock jumped to $2.75 (I sold @ $2.50 for a massive profit). I then bought back in at $1.25 where I am holding until the next major piece of news. """ Here's the start of the third: """ Grand Treasure Industrial Limited Contact Information We are a manufacturer and exporter in Hong Kong for all kinds of plastic products, We export to worldwide markets. Recently , we join-ventured with a bag factory in China produce all kinds of shopping , lady's , traveller's bags.... visit our page and send us your enquiry by email now. Contact Address : Rm. 1905, Asian Trade Centre , 79 Lei Muk Rd, Tsuen Wan , Hong Kong. Telephone : ( 852 ) 2408 9382 """ That is, all the "false positives" there are blatant spam. It will take a long time to sort this all out, but I want to make a point here now: the classifier works so well that it can *help* clean the ham corpus! I haven't found a non-spam among the "false positives" yet. Another lesson reinforces one from my previous life in speech recognition: rigorous data collection, cleaning, tagging and maintenance is crucial when working with statisical approaches, and is damned expensive to do. Here's the start of the first "false negative" (including the headers): """ Return-Path: <911@911.COM> Delivered-To: em-ca-bruceg@em.ca Received: (qmail 24322 invoked from network); 28 Jul 2002 12:51:44 -0000 Received: from unknown (HELO PC-5.) (61.48.16.65) by churchill.factcomp.com with SMTP; 28 Jul 2002 12:51:44 -0000 x-esmtp: 0 0 1 Message-ID: <1604543-22002702894513952@smtp.vip.sina.com> To: "NEW020515" <911@911.COM> From: "ÖйúITÊý¾Ý¿âÍøÕ¾£¨www.itdatabase.net £©" <911@911.COM> Subject: ÖйúITÊý¾Ý¿âÍøÕ¾£¨www.itdatabase.net £© Date: Sun, 28 Jul 2002 17:45:13 +0800 MIME-Version: 1.0 Content-type: text/plain; charset=gb2312 Content-Transfer-Encoding: quoted-printable Content-Length: 977 =D6=D0=B9=FAIT=CA=FD=BE=DD=BF=E2=CD=F8=D5=BE=A3=A8www=2Eitdatabase=2Enet =A3= =A9=CC=E1=B9=A9=B4=F3=C1=BF=D3=D0=B9=D8=D6=D0=B9=FAIT/=CD=A8=D0=C5=CA=D0=B3= =A1=D2=D4=BC=B0=C8=AB=C7=F2IT/=CD=A8=D0=C5=CA=D0=B3=A1=B5=C4=CF=E0=B9=D8=CA= =FD=BE=DD=BA=CD=B7=D6=CE=F6=A1=A3 =B1=BE=CD=F8=D5=BE=C9=E6=BC=B0=D3=D0=B9=D8= =B5=E7=D0=C5=D4=CB=D3=AA=CA=D0=B3=A1=A1=A2=B5=E7=D0=C5=D4=CB=D3=AA=C9=CC=A1= """ Since I'm ignoring the headers, and the tokenizer is just a whitespace split, each line of quoted-printable looks like a single word to the classifier. Since it's never seen these "words" before, it has no reason to believe they're either spam or ham indicators, and favors calling it ham. One more mondo cool thing and that's it for now. The GrahamBayes class keeps track of how many times each word makes it into the list of the 15 strongest indicators. These are the "killer clues" the classifier gets the most value from. The most valuable spam indicator turned out to be "<br>" -- there's simply almost no HTML mail in the ham archive (but note that this clue would be missed if you stripped HTML!). You're never going to guess what the most valuable non-spam indicator was, but it's quite plausible after you see it. Go ahead, guess. Chicken <wink>. Here are the 15 most-used killer clues across the runs shown above: the repr of the word, followed by the # of times it made into the 15-best list, and the estimated probability that a msg is spam if it contains this word: testing against Data/Spam/Set2 and Data/Ham/Set2 best discrimators: 'Helvetica,' 243 0.99 'object' 245 0.01 'language' 258 0.01 '<BR>' 292 0.99 '>' 339 0.179104 'def' 397 0.01 'article' 423 0.01 'module' 436 0.01 'import' 499 0.01 '<br>' 652 0.99 '>>>' 667 0.01 'wrote' 677 0.01 'python' 755 0.01 'Python' 1947 0.01 'wrote:' 1988 0.01 testing against Data/Spam/Set3 and Data/Ham/Set3 best discrimators: 'string' 494 0.01 'Helvetica,' 496 0.99 'language' 524 0.01 '<BR>' 553 0.99 '>' 687 0.179104 'article' 851 0.01 'module' 857 0.01 'def' 875 0.01 'import' 1019 0.01 '<br>' 1288 0.99 '>>>' 1344 0.01 'wrote' 1355 0.01 'python' 1461 0.01 'Python' 3858 0.01 'wrote:' 3984 0.01 testing against Data/Spam/Set4 and Data/Ham/Set4 best discrimators: 'object' 749 0.01 'Helvetica,' 757 0.99 'language' 763 0.01 '<BR>' 877 0.99 '>' 954 0.179104 'article' 1240 0.01 'module' 1260 0.01 'def' 1364 0.01 'import' 1517 0.01 '<br>' 1765 0.99 '>>>' 1999 0.01 'wrote' 2071 0.01 'python' 2160 0.01 'Python' 5848 0.01 'wrote:' 6021 0.01 testing against Data/Spam/Set5 and Data/Ham/Set5 best discrimators: 'object' 980 0.01 'language' 992 0.01 'Helvetica,' 1005 0.99 '<BR>' 1139 0.99 '>' 1257 0.179104 'article' 1678 0.01 'module' 1702 0.01 'def' 1846 0.01 'import' 2003 0.01 '<br>' 2387 0.99 '>>>' 2624 0.01 'wrote' 2743 0.01 'python' 2864 0.01 'Python' 7830 0.01 'wrote:' 8060 0.01 Note that an "intelligent" tokenizer would likely miss that the Python prompt ('>>>') is a great non-spam indicator on python-list. I've had this argument with some of you before <wink>, but the best way to let this kind of thing be as intelligent as it can be is not to try to help it too much: it will learn things you'll never dream of, provided only you don't filter clues out in an attempt to be clever. everything's-a-clue-ly y'rs - tim
On 27 August 2002, Tim Peters said:
Setting this up has been a bitch. All early attempts floundered because it turned out there was *some* systematic difference between the ham and spam archives that made the job trivial.
The ham archive: I selected 20,000 messages, and broke them into 5 sets of 4,000 each, at random, from a python-list archive Barry put together, containing msgs only after SpamAssassin was put into play on python.org. It's hoped that's pretty clean, but nobody checked all ~= 160,000+ msgs. As will be seen below, it's not clean enough.
One of the other perennial-seeming topics on spamassassin-devel (a list that I follow only sporodically) is that careful manual cleaning of your corpus is *essential*. The concern of the main SA developers is that spam in your non-spam folder (and vice-versa) will prejudice the genetic algorithm that evolves SA's scores in the wrong direction. Gut instinct tells me the Bayesian approach ought to be more robust against this sort of thing, but even it must have a breaking point at which misclassified messages throw off the probabilities. But that's entirely consistent with your statement:
Another lesson reinforces one from my previous life in speech recognition: rigorous data collection, cleaning, tagging and maintenance is crucial when working with statisical approaches, and is damned expensive to do.
On corpus collection...
The spam archive: This is essentially all of Bruce Guenter's 2002 spam collection, at <http://www.em.ca/~bruceg/spam/>. It was broken at random into 5 sets of 2,750 spams each.
One possibility occurs to me: we could build our own corpus by collecting spam on python.org for a few weeks. Here's a rough breakdown of mail rejected by mail.python.org over the last 10 days, eyeball-estimated messages per day: bad RCPT 150 - 300 [1] bad sender 50 - 190 [2] relay denied 20 - 180 [3] known spammer addr/domain 15 - 60 8-bit chars in subject 130 - 200 8-bit chars in header addrs 10 - 60 banned charset in subject 5 - 50 [4] "ADV" in subject 0 - 5 no Message-Id header 100 - 400 [5] invalid header address syntax 5 - 50 [6] no valid senders in header 10 - 15 [7] rejected by SpamAssassin 20 - 50 [8] quarantined by SpamAssassin 5 - 50 [8] [1] this includes mail accidentally sent to eg. giudo@python.org, but based on scanning the reject logs, I'd say the vast majority is spam. However, such messages are rejected after RCPT TO, so we never see the message itself. Most of the bad recipient addrs are either ancient (ipc6@python.org, grail-feedback@python.org) or fictitious (success@python.org, info@python.org). [2] sender verification failed, eg. someone tried to claim an envelope sender like foo@bogus.domain. Usually spam, but innocent bystanders can be hit by DNS servers suddenly exploding (hello, comcast.net). This only includes hard failures (DNS "no such domain"), not soft failures (DNS timeout). [3] I'd be leery of accepting mail that's trying to hijack mail.python.org as an open relay, even though that would be a goldmine of spam. (OTOH, we could reject after the DATA command, and save the message anyways.) [4] mail.python.org rejects any message with a properly MIME-encoded subject using any of the following charsets: big5, euc-kr, gb2312, ks_c_5601-1987 [5] includes viruses as well as spam (and no doubt some innocent false positives, although I have added exemptions for the MUA/MTA combinations that most commonly result in legit mail reaching mail.python.org without a Message-Id header, eg. KMail/qmail) [6] eg. "To: all my friends" or "From: <>" [7] no valid sender address in any header line -- eg. someone gives a valid MAIL FROM address, but then puts "From: blah@bogus.domain" in the headers. Easily defeated with a "Sender" or "Reply-to" header. [8] any message scoring >= 10.0 is rejected at SMTP time; any message scoring >= 5.0 but < 10 is saved in /var/mail/spam for later review Executive summary: * it's a good thing we do all those easy checks before involving SA, or the load on the server would be a lot higher * give me 10 days of spam-harvesting, and I can equal Bruce Guenter's spam archive for 2002. (Of course, it'll take a couple of days to set the mail server up for the harvesting, and a couple more days to clean through the ~2000 caught messages, but you get the idea.)
+ Mailman added distinctive headers to every message in the ham archive, which appear nowhere in the spam archive. A Bayesian classifier picks up on that immediately.
+ Mailman also adds "[name-of-list]" to every Subject line.
Perhaps that spam-harvesting run should also set aside a random selection of apparently-non-spam messages received at the same time. Then you'd have a corpus of mail sent to the same server, more-or-less to the same addresses, over the same period of time. Oh, any custom corpus should also include the ~300 false positives and ~600 false negatives gathered since SA started running on mail.python.org in April. Greg
[Greg Ward]
One of the other perennial-seeming topics on spamassassin-devel (a list that I follow only sporodically) is that careful manual cleaning of your corpus is *essential*. The concern of the main SA developers is that spam in your non-spam folder (and vice-versa) will prejudice the genetic algorithm that evolves SA's scores in the wrong direction. Gut instinct tells me the Bayesian approach ought to be more robust against this sort of thing, but even it must have a breaking point at which misclassified messages throw off the probabilities.
Like all other questions <wink>, this can be quantified if someone is willing to do the grunt work of setting up, running, and analyzing appropriate experiments. This kind of algorithm is generally quite robust against disaster, but note that even tiny changes in accuracy rates can have a large effect on *you*: say that 99% of the time the system says a thing is spam, it really is. Then say that degrades by a measly 1%: 99% falls to 98%. From *your* POV this is huge, because the error rate has actually doubled (from 1% wrong to 2% wrong: you've got twice as many false positives to deal with). So the scheme has an ongoing need for accurate human training (spam changes, list topics change, list members change, etc; the system needs an ongoing random sample of both new spam and new non-spam to adapt).
... One possibility occurs to me: we could build our own corpus by collecting spam on python.org for a few weeks.
Simpler is better: as you suggested later, capture everything for a while, and without injecting Mailman or SpamAssasin headers. That won't be a particularly good corpus for the lists in general, because over any brief period a small number of topics and posters dominate. But it will be a fair test for how systems do over exactly that brief period <wink>.
Here's a rough breakdown of mail rejected by mail.python.org over the last 10 days, eyeball-estimated messages per day:
bad RCPT 150 - 300 [1] bad sender 50 - 190 [2] relay denied 20 - 180 [3] known spammer addr/domain 15 - 60 8-bit chars in subject 130 - 200 8-bit chars in header addrs 10 - 60 banned charset in subject 5 - 50 [4] "ADV" in subject 0 - 5 no Message-Id header 100 - 400 [5] invalid header address syntax 5 - 50 [6] no valid senders in header 10 - 15 [7] rejected by SpamAssassin 20 - 50 [8] quarantined by SpamAssassin 5 - 50 [8]
We should start another category, "Messages from Tim rejected for bogus reasons" <wink>.
[1] this includes mail accidentally sent to eg. giudo@python.org, but based on scanning the reject logs, I'd say the vast majority is spam. However, such messages are rejected after RCPT TO, so we never see the message itself. Most of the bad recipient addrs are either ancient (ipc6@python.org, grail-feedback@python.org) or fictitious (success@python.org, info@python.org).
[2] sender verification failed, eg. someone tried to claim an envelope sender like foo@bogus.domain. Usually spam, but innocent bystanders can be hit by DNS servers suddenly exploding (hello, comcast.net). This only includes hard failures (DNS "no such domain"), not soft failures (DNS timeout).
[3] I'd be leery of accepting mail that's trying to hijack mail.python.org as an open relay, even though that would be a goldmine of spam. (OTOH, we could reject after the DATA command, and save the message anyways.)
[4] mail.python.org rejects any message with a properly MIME-encoded subject using any of the following charsets: big5, euc-kr, gb2312, ks_c_5601-1987
[5] includes viruses as well as spam (and no doubt some innocent false positives, although I have added exemptions for the MUA/MTA combinations that most commonly result in legit mail reaching mail.python.org without a Message-Id header, eg. KMail/qmail)
[6] eg. "To: all my friends" or "From: <>"
[7] no valid sender address in any header line -- eg. someone gives a valid MAIL FROM address, but then puts "From: blah@bogus.domain" in the headers. Easily defeated with a "Sender" or "Reply-to" header.
[8] any message scoring >= 10.0 is rejected at SMTP time; any message scoring >= 5.0 but < 10 is saved in /var/mail/spam for later review
Greg, you show signs of enjoying this job too much <wink>.
Executive summary:
* it's a good thing we do all those easy checks before involving SA, or the load on the server would be a lot higher
So long as easy checks don't block legitimate email, I can't complain about that.
* give me 10 days of spam-harvesting, and I can equal Bruce Guenter's spam archive for 2002. (Of course, it'll take a couple of days to set the mail server up for the harvesting, and a couple more days to clean through the ~2000 caught messages, but you get the idea.)
If it would be helpful for me to do research on corpora that include the headers, then the point would be to collect both spam and non-spam messages, so that they can be compared directly to each other. Those should be as close to the bytes coming off the pipe as possible (e.g., before injecting new headers of our own). As is, I've had to throw the headers away in both corpora, so am, in effect, working with a crippled version of the algorithm. Or if someone else is doing research on how best to tokenize and tag headers, I'm not terribly concerned about merging the approaches untested. If the approach is valuable enough to deploy, we'll eventually see exactly how well it works in real life.
... Perhaps that spam-harvesting run should also set aside a random selection of apparently-non-spam messages received at the same time. Then you'd have a corpus of mail sent to the same server, more-or-less to the same addresses, over the same period of time.
Yes, it wants something as close to a slice of real life as possible, in all conceivable respects, including ratio of spam to not spam, arrival times, and so on.
Oh, any custom corpus should also include the ~300 false positives and ~600 false negatives gathered since SA started running on mail.python.org in April.
Definitely not. That's not a slice of real life, it's a distortion based on how some *other* system screwed up. Train it systematically on that, and you're not training it for real life. The urge to be clever is strong, but must be resisted <0.3 wink>. What would be perfectly reasonable is to run (not train) the system against those corpora to see how it does. BTW, Barry said the good-message archives he put together were composed of msgs archived after SpamAssassin was enabled. Since about 80% of the 1% "false positive" rate I first saw turned out to be blatant spam in the ham corpus, this suggests SpamAssassin let about 160000 * 1% * 80% = 12800 spams through to the python-list archive alone. That doesn't jibe with "600 false negatives" at all. I don't want to argue about it, it's just fair warning that I don't believe much that I hear <wink>. In particular, in *this* case I don't believe python-list actually got 160000 messages since April, unless we're talking about April of 2000.
FYI. After cleaning the blatant spam identified by the classifier out of my ham corpus, and replacing it with new random msgs from Barry's corpus, the reported false positive rate fell to about 0.2% (averaging 8 per each batch of 4000 ham test messages). This seems remarkable given that it's ignoring headers, and just splitting the raw text on whitespace in total ignorance of HTML & MIME etc. 'FREE' (all caps) moved into the ranks of best spam indicators. The false negative rate got reduced by a small amount, but I doubt it's a statistically significant reduction (I'll compute that stuff later; I'm looking for Big Things now). Some of these false positives are almost certainly spam, and at least one is almost certainly a virus: these are msgs that are 100% base64-encoded, or maximally obfuscated quoted-printable. That could almost certainly be fixed by, e.g., decoding encoded text. The other false positives seem harder to deal with: + Brief HMTL msgs from newbies. I doubt the headers will help these get through, as they're generally first-time posters, and aren't replies to earlier msgs. There's little positive content, while all elements of raw HTML have high "it's spam" probability. Example: """ --------------=_4D4800B7C99C4331D7B8 Content-Description: filename="text1.txt" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Is there a version of Python with Prolog Extension?? Where can I find it if there is? Thanks, Luis. P.S. Could you please reply to the sender too. --------------=_4D4800B7C99C4331D7B8 Content-Description: filename="text1.html" Content-Type: text/html Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE>Prolog Extension</TITLE> <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)"> <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200"> <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes"> <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700"> </HEAD> <BODY> <PRE>Is there a version of Python with Prolog Extension?? Where can I find it if there is? Thanks, Luis. P.S. Could you please reply to the sender too.</PRE> </BODY> </HTML> --------------=_4D4800B7C99C4331D7B8--""" """ Here's how it got scored: prob = 0.999958816093 prob('<META') = 0.957529 prob('<META') = 0.957529 prob('<META') = 0.957529 prob('<BODY>') = 0.979284 prob('Prolog') = 0.01 prob('<HEAD>') = 0.97989 prob('Thanks,') = 0.0337316 prob('Prolog') = 0.01 prob('Python') = 0.01 prob('NAME=3D"GENERATOR"') = 0.99 prob('<HTML>') = 0.99 prob('</HTML>') = 0.989494 prob('</BODY>') = 0.987429 prob('Thanks,') = 0.0337316 prob('Python') = 0.01 Note that '<META' gets penalized 3 times. More on that later. + Msgs talking *about* HTML, and including HTML in examples. This one may be troublesome, but there are mercifully few of them. + Brief msgs with obnoxious employer-generated signatures. Example: """ Hi there, I am looking for you recommendations on training courses available in the UK on Python. Can you help? Thanks, Vickie Mills IS Training Analyst Tel: 0131 245 1127 Fax: 0131 245 1550 E-mail: vickie_mills@standardlife.com For more information on Standard Life, visit our website http://www.standardlife.com/ The Standard Life Assurance Company, Standard Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland (No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225 2552 - calls may be recorded or monitored. This confidential e-mail is for the addressee only. If received in error, do not retain/copy/disclose it without our consent and please return it to us. We virus scan all e-mails but are not responsible for any damage caused by a virus or alteration by a third party after it is sent. """ The scoring: prob = 0.98654879055 prob('our') = 0.928936 prob('sent.') = 0.939891 prob('Tel:') = 0.0620155 prob('Thanks,') = 0.0337316 prob('received') = 0.940256 prob('Tel:') = 0.0620155 prob('Hi') = 0.0533333 prob('help?') = 0.01 prob('Personal') = 0.970976 prob('regulated') = 0.99 prob('Road,') = 0.01 prob('Training') = 0.99 prob('e-mails') = 0.987542 prob('Python.') = 0.01 prob('Investment') = 0.99 The brief human-written part is fine, but the longer boilerplate sig is indistinguishable from spam. + The occassional non-Python conference announcement(!). These are long, so I'll skip an example. In effect, it's automated bulk email trying to sell you a conference, so is prone to use the language and artifacts of advertising. Here's typical scoring, for the TOOLS Europe '99 conference announcement: prob = 0.983583974285 prob('THE') = 0.983584 prob('Object') = 0.01 prob('Bell') = 0.01 prob('Object-Oriented') = 0.01 prob('**************************************************************') = 0.99 prob('Bertrand') = 0.01 prob('Rational') = 0.01 prob('object-oriented') = 0.01 prob('CONTACT') = 0.99 prob('**************************************************************') = 0.99 prob('innovative') = 0.99 prob('**************************************************************') = 0.99 prob('Olivier') = 0.01 prob('VISIT') = 0.99 prob('OUR') = 0.99 Note the repeated penalty for the lines of asterisks. That segues into the next one: + Artifacts of that the algorithm counts multiples instances of "a word" multiple times. These are baffling at first sight! The two clearest examples: """
Can you create and use new files with dbhash.open()?
Yes. But if I run db_dump on these files, it says "unexpected file type or format", regardless which db_dump version I use (2.0.77, 3.0.55, 3.1.17)
It may be that db_dump isn't compatible with version 1.85 databse files. I can't remember. I seem to recall that there was an option to build 1.85 versions of db_dump and db_load. Check the configure options for BerkeleyDB to find out. (Also, while you are there, make sure that BerkeleyDB was built the same on both of your platforms...)
Try running db_verify (one of the utilities built when you compiled DB) on the file and see what it tells you.
There is no db_verify among my Berkeley DB utilities.
There should have been a bunch of them built when you compiled DB. I've got these: -r-xr-xr-x 1 rd users 343108 Dec 11 12:11 db_archive -r-xr-xr-x 1 rd users 342580 Dec 11 12:11 db_checkpoint -r-xr-xr-x 1 rd users 342388 Dec 11 12:11 db_deadlock -r-xr-xr-x 1 rd users 342964 Dec 11 12:11 db_dump -r-xr-xr-x 1 rd users 349348 Dec 11 12:11 db_load -r-xr-xr-x 1 rd users 340372 Dec 11 12:11 db_printlog -r-xr-xr-x 1 rd users 341076 Dec 11 12:11 db_recover -r-xr-xr-x 1 rd users 353284 Dec 11 12:11 db_stat -r-xr-xr-x 1 rd users 340340 Dec 11 12:11 db_upgrade -r-xr-xr-x 1 rd users 340532 Dec 11 12:11 db_verify -- Robin Dunn Software Craftsman robin@AllDunn.com http://wxPython.org Java give you jitters? http://wxPROs.com Relax with wxPython! """ Looks utterly on-topic! So why did Robin's msg get flagged? It's solely due to his Unix name in the ls output(!): prob = 0.999999999895 prob('Berkeley') = 0.01 prob('configure') = 0.01 prob('remember.') = 0.01 prob('these:') = 0.01 prob('recall') = 0.01 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 prob('rd') = 0.99 Spammers often generate random "word-like" gibberish at the ends of msgs, and "rd" is one of the random two-letter combos that appears in the spam corpus. Perhaps it would be good to ignore "words" with fewer than W characters (to be determined by experiment). The other example is long, an off-topic but delightful exchange between Peter Hansen and Alex Martelli. Here's a "typical" paragraph: Since it's important to use very abundant amounts of water when cooking pasta, the price of what is still a very cheap dish would skyrocket if that abundant water had to be costly bottled mineral water. The scoring: prob = 0.99 prob('"Peter') = 0.01 prob(':-)') = 0.01 prob('<peter@engcorp.com>') = 0.01 prob('tasks') = 0.01 prob('drinks') = 0.01 prob('wrote') = 0.01 prob('Hansen"') = 0.01 prob('water') = 0.99 prob('water') = 0.99 prob('skyrocket') = 0.99 prob('water') = 0.99 prob('water') = 0.99 prob('water') = 0.99 prob('water') = 0.99 prob('water') = 0.99 Alex is drowning in his aquatic excess <wink>. I expect that including the headers would have given these much better chances of getting through, given Robin and Alex's posting histories. Still, the idea of counting words multiple times is open to question, and experiments both ways are in order. + Brief put-ons, like """ HEY DUDEZ ! I WANT TO GET INTO THIS AUTOCODING THING. ANYONE KNOW WHERE I CAN GET SOME IBM 1401 WAREZ ? -- MULTICS-MAN """ It's not actually things like WAREZ that hurt here, it's more the mere fact of SCREAMING: prob = 0.999982095931 prob('AUTOCODING') = 0.2 prob('THING.') = 0.2 prob('DUDEZ') = 0.2 prob('ANYONE') = 0.884211 prob('GET') = 0.847334 prob('GET') = 0.847334 prob('HEY') = 0.2 prob('--') = 0.0974729 prob('KNOW') = 0.969697 prob('THIS') = 0.953191 prob('?') = 0.0490886 prob('WANT') = 0.99 prob('TO') = 0.988829 prob('CAN') = 0.884211 prob('WAREZ') = 0.2 OTOH, a lot of the Python community considered the whole autocoding thread to be spam, and I personally could have lived without this contribution to its legacy (alas, the autocoding thread wasn't spam, just badly off-topic). + Msgs top-quoting an earlier spam in its entirety. For example, one msg quoted an entire Nigerian scam msg, and added just Aw jeez, another one of these Nigerian wire scams. This one has been around for 20 years. What's an acceptable false positive rate? What do we get from SpamAssassin? I expect we can end up below 0.1% here, and with a generous meaning for "not spam", but I think *some* of these examples show that the only way to get a 0% false-positive rate is to recode spamprob like so: def spamprob(self, wordstream, evidence=False): return 0.0 That would also allow other simplifications <wink>.
On 28 August 2002, Tim Peters said:
What's an acceptable false positive rate?
Speaking as one of the people who reviews suspected spam for python.org and rescues false positives, I would say that the more relevant figure is: how much suspected spam do I have to review every morning? < 10 messages would be peachy; right now it's around 5-20 messages per day. Currently there are probably 1-3 FPs per day, although on a bad day there can be 5-10. (Eg. on 2002-08-21, six mailman-users posts from the same guy were all caught, mainly because his ISP added X-AntiAbuse, and his messages were multipart/alternative with unwrapped plain text. This is a perfect example of SpamAssassin screwing up royally.) 1-3 FPs/day I can live with, but the real burden is the manual review: I'd much rather have 5 FPs in a pool of 10 suspects than 1 FP out of 100 suspects.
What do we get from SpamAssassin?
Recall the stats I posted this morning; the bulk of spam is in Chinese or Korean, and I have things setup so SpamAssassin never even sees it. I think the only way to meaningfully answer this question is to stash *everything* mail.python.org receives for a day or 10, spam and otherwise, and run it all through SA. Greg
[Tim]
What's an acceptable false positive rate?
[Greg Ward]
Speaking as one of the people who reviews suspected spam for python.org and rescues false positives, I would say that the more relevant figure is: how much suspected spam do I have to review every morning? < 10 messages would be peachy; right now it's around 5-20 messages per day.
I must be missing something. I would *hope* that you review *all* messages claimed to be spam, in which case the number of msgs to be reviewed would, in a perfectly accurate system, be equal to the number of spams received. OTOH, the false positive rate doesn't have anything to do with the number of spams received, it has to do with the number of non-spams received.
Currently there are probably 1-3 FPs per day, although on a bad day there can be 5-10. (Eg. on 2002-08-21, six mailman-users posts from the same guy were all caught, mainly because his ISP added X-AntiAbuse, and his messages were multipart/alternative with unwrapped plain text. This is a perfect example of SpamAssassin screwing up royally.) 1-3 FPs/day I can live with, but the real burden is the manual review: I'd much rather have 5 FPs in a pool of 10 suspects than 1 FP out of 100 suspects.
Maybe you don't want this kind of approach at all. The classifier doesn't have "gray areas" in practice: it tends to give probabilites near 1, or near 0, and there's very little in between -- a msg either has a preponderance of spam indicators, or a preponderance of non-spam indicators. You're simply not going to get a batch of "hmm, I'm not really sure about these" out of it. You would in a conventional Bayesian classifer, but Graham's ignores almost all of the words, judging on only the most extreme words present; when only extremes are fed in, the final result also tends to be extreme (the only cases where that doesn't obtain are those where the most extreme words it finds aren't extreme at all; e.g., a msg consisting entirely of "the", "and" and "it" would get rated as 0.5).
What do we get from SpamAssassin?
Recall the stats I posted this morning; the bulk of spam is in Chinese or Korean, and I have things setup so SpamAssassin never even sees it. I think the only way to meaningfully answer this question is to stash *everything* mail.python.org receives for a day or 10, spam and otherwise, and run it all through SA.
It would be good to have such a corpus regardless.
[Tim, last week]
What's an acceptable false positive rate?
[my response]
Speaking as one of the people who reviews suspected spam for python.org and rescues false positives, I would say that the more relevant figure is: how much suspected spam do I have to review every morning? < 10 messages would be peachy; right now it's around 5-20 messages per day.
[Tim again]
I must be missing something. I would *hope* that you review *all* messages claimed to be spam, in which case the number of msgs to be reviewed would, in a perfectly accurate system, be equal to the number of spams received.
Good lord, certainly not! Remember that Exim rejects a couple hundred messages a day that never get near SpamAssassin -- that's mostly Chinese/Korean junk that's rejected on the basis of 8-bit chars or banned charsets in the headers. Then, probably 50-75% of what SA gets its hands on scores >= 10.0, so it too is rejected at SMTP time. Only messages that score < 10 are accepted, and those that score >= 5.0 are set aside in /var/mail/spam for review. That's 10-30 messages/day. (I do occasionally scan Exim's reject log on mail.python.org to see what's getting rejected today -- Exim kindly logs the full headers of every message that is rejected after the DATA command. I usually make it to about 11am of a given day's logfile before my eyes glaze over from the endless stream of spam and viruses.) Note that we *used* to accept messages before passing them to SpamAssassin, so never rejected anything on the basis of its SA score. Back then, we saved and reviewed probably 50-70 messages/day. Very, very, very few (if any) false positives scored >= 10.0, which is why that's the threshold for SMTP-time rejection.
OTOH, the false positive rate doesn't have anything to do with the number of spams received, it has to do with the number of non-spams received.
Err, yeah, good point. I make a point of talking about "suspected spam", which is any message that scores between 5.0 and 10.0. IMHO, the true nature of those messages can only be determined by manual inspection.
Maybe you don't want this kind of approach at all. The classifier doesn't have "gray areas" in practice: it tends to give probabilites near 1, or near 0, and there's very little in between -- a msg either has a preponderance of spam indicators, or a preponderance of non-spam indicators.
That's a great improvement over SpamAssassin then: with SA, the grey area (IMHO) is scores from 3 to 10... which is why several python.org lists now have a little bit of Mailman configuration magic that makes MM set aside messages with an SA score >= 3 for list admin review. (It's probably worth getting the list admin to do a bit more work in order to avoid sending low-scoring spam to the list.) However, as long as "very little" != "nothing", we still need to worry a bit about that grey area. What do you think we should do with a message whose spam probability is between (say) 0.1 and 0.9? Send it on, reject it, or set it aside? Just how many messages fall in that grey area anyways? Greg -- Greg Ward <gward@python.net> http://www.gerg.ca/ MTV -- get off the air! -- Dead Kennedys
[Tim again]
I must be missing something. I would *hope* that you review *all* messages claimed to be spam, in which case the number of msgs to be reviewed would, in a perfectly accurate system, be equal to the number of spams received.
[Greg Ward]
Good lord, certainly not! Remember that Exim rejects a couple hundred messages a day that never get near SpamAssassin -- that's mostly Chinese/Korean junk that's rejected on the basis of 8-bit chars or banned charsets in the headers. Then, probably 50-75% of what SA gets its hands on scores >= 10.0, so it too is rejected at SMTP time. Only messages that score < 10 are accepted, and those that score >= 5.0 are set aside in /var/mail/spam for review. That's 10-30 messages/day.
(I do occasionally scan Exim's reject log on mail.python.org to see what's getting rejected today -- Exim kindly logs the full headers of every message that is rejected after the DATA command. I usually make it to about 11am of a given day's logfile before my eyes glaze over from the endless stream of spam and viruses.)
I get about 200 spams per day on my own email accounts, and look at all of them. I don't look at the headers at all, I just look at the msgs in a capable HTML-aware mail reader, as a matter of course while dealing with all the day's email. It's rare that it takes more than a second to recognize a spam by eyeball and hit the delete key. At about 200 per day, it's just now reaching my "hmm, this is becoming a nuisance sometimes" threshold. Our tolerance levels for manual review seem to differ by a factor of 100 or more <wink>.
Note that we *used* to accept messages before passing them to SpamAssassin, so never rejected anything on the basis of its SA score. Back then, we saved and reviewed probably 50-70 messages/day. Very, very, very few (if any) false positives scored >= 10.0, which is why that's the threshold for SMTP-time rejection.
I can tell you the mean false negative and false positive rates on what I've been working on, and even measure their variance across both training and prediction sets. (The fn rate is well under 2% now (adding in more headers should improve that a lot), and the fp rate under 0.05% (but I doubt that adding in more headers will improve this)). So long as we don't know the rates for the scheme you're using now, there's no objective basis for comparison. ...
Maybe you don't want this kind of approach at all. The classifier doesn't have "gray areas" in practice: it tends to give probabilites near 1, or near 0, and there's very little in between -- a msg either has a preponderance of spam indicators, or a preponderance of non-spam indicators.
That's a great improvement over SpamAssassin then: with SA, the grey area (IMHO) is scores from 3 to 10... which is why several python.org lists now have a little bit of Mailman configuration magic that makes MM set aside messages with an SA score >= 3 for list admin review. (It's probably worth getting the list admin to do a bit more work in order to avoid sending low-scoring spam to the list.)
However, as long as "very little" != "nothing", we still need to worry a bit about that grey area. What do you think we should do with a message whose spam probability is between (say) 0.1 and 0.9? Send it on, reject it, or set it aside?
Under Graham's scheme, send it on. It doesn't have grey areas in a useful sense, becuase the scoring step only looks at a handful of extremes: extremes in, extremes out, and when it's wrong it's *spectacularly* wrong (e.g., the very rare (< 0.05%) false positives generally have "probabilties" exceeding 0.99, and a false negative often has a "probability" less then 0.01).
Just how many messages fall in that grey area anyways?
I can't get at my testing setup now and don't know the answer offhand. I'll try to make time tonight to determine the answer. I guess the interesting stats are what percent of hams have probs in (0.1, 0.9), and what percent of spams. In general, it's only very brief messages that don't score near 0.0 or 1.0, so this *may* turn out to be the same thing as asking what percentages of hams and spams are very brief. Note too that adding the headers in *should* catch a lot more spam under this scheme. But, even as is, and even if I strip all the HTML tags out of spam, fewer than 1 spam in 50 scores less than 0.9. The ones that are passed on now include all spams with empty bodies (a message with an empty body scores 0.5).
Tim Peters wrote:
Under Graham's scheme, send it on. It doesn't have grey areas in a useful sense, becuase the scoring step only looks at a handful of extremes: extremes in, extremes out, and when it's wrong it's *spectacularly* wrong (e.g., the very rare (< 0.05%) false positives generally have "probabilties" exceeding 0.99, and a false negative often has a "probability" less then 0.01).
I noticed that as well. When the classifier goes wrong it goes badly wrong and using different thresholds would not help. It seems that increasing the number of discriminators doesn't really help either. Too bad because otherwise you could flag those messages for human classification. On the bright side, based on the number of mis-classified messages in my corpus, it looks like a human would have a very hard time doing a better job. Perhaps all that is needed is a bypass mechanism for that small fraction of non-spammers. That way if their initial message is rejected they would still have some way of getting through. Erik Naggum made an interesting comment. He said that spam should be handled at the transport level. Greg's work on doing filtering at SMTP time accomplishes this and makes a lot of sense. When a message is rejected, the sending mail server is the one that has to deal with it. In the case of spam, the sending server is often an open rely. Letting it handle the bounces is sweet justice. :-) I bring this up because "STMP time filtering" makes a bypass mechanism work much better. With a system like TMDA, confirmation notices usually generate double-bounces. Instead, we could reject the message with a 5xx error that includes instructions on how to bypass the filter (e.g. include a cookie in the body of the message). Neil
Erik Naggum made an interesting comment. He said that spam should be handled at the transport level. Greg's work on doing filtering at SMTP time accomplishes this and makes a lot of sense. When a message is rejected, the sending mail server is the one that has to deal with it. In the case of spam, the sending server is often an open rely. Letting it handle the bounces is sweet justice. :-)
In the case of a false positive, it has the added advantage that at least the poor sender, falsely accused of sending spam, gets a bounce and may try to try again.
I bring this up because "STMP time filtering" makes a bypass mechanism work much better. With a system like TMDA, confirmation notices usually generate double-bounces. Instead, we could reject the message with a 5xx error that includes instructions on how to bypass the filter (e.g. include a cookie in the body of the message).
Do you still believe that TMDA is the only answer to spam? --Guido van Rossum (home page: http://www.python.org/~guido/)
[Neil Schemenauer]
I noticed that as well. When the classifier goes wrong it goes badly wrong and using different thresholds would not help. It seems that increasing the number of discriminators doesn't really help either. Too bad because otherwise you could flag those messages for human classification.
I think it's worse than just that: suppose any scheme says "OK, this is spam, with probability 0.9995". If it's reporting accurate probabilities, then another way to read that claim is "On average, one time in 2000 this message actually isn't spam". In real life we have to accept that there's no scheme with a 0% false positive rate-- not even human review --short of the scheme that never calls anything spam. Since deciding on the largest acceptable false positive rate is far more a social than a technical issue, a group of nerds will do anything rather than face it <wink>.
Tim Peters wrote:
Since deciding on the largest acceptable false positive rate is far more a social than a technical issue, a group of nerds will do anything rather than face it <wink>.
I think we pretty much ran out of things to do. :-) Still, I think the acceptable rate depends heavily on what happens to the rejects. If they go to /dev/null then it would have to be very low. If there are bounces and a way for the innocent victims to bypass the filter then I consider 0.5% good enough for most situations. The major remaining problem would be handing legitimate automated email. For mailing lists that probably isn't an issue. I'm probably not the guy to listen to about acceptable rates, though. I currently use TMDA and therefore am a heartless bastard. :-) Neil
Neil Schemenauer <nas@python.ca> writes:
I bring this up because "STMP time filtering" makes a bypass mechanism work much better. With a system like TMDA, confirmation notices usually generate double-bounces. Instead, we could reject the message with a 5xx error that includes instructions on how to bypass the filter (e.g. include a cookie in the body of the message).
TMDA doesn't do this because it would make more work for the sender to get his message delivered. Because TMDA stores the incoming messages in a local queue, the sender just has to reply to a confirmation request, and his original message gets delivered. As opposed to having to cut and paste his message from the body of a bounce and then resend it. So, not operating at the transport level saves your correspondents some work at the expense of some bandwidth. -- (http://tmda.net/)
Tim Peters <tim.one@comcast.net> wrote:
Under Graham's scheme, send it on. It doesn't have grey areas in a useful sense, becuase the scoring step only looks at a handful of extremes: extremes in, extremes out, and when it's wrong it's *spectacularly* wrong (e.g., the very rare (< 0.05%) false positives generally have "probabilties" exceeding 0.99, and a false negative often has a "probability" less then 0.01).
I would love to see how the results would be affected by applying the scoring scheme to the entire content of the message, instead of just the 15 (or 16 in your case) most extreme samples. By the way, you never said why you increased that number by one; did it make that much difference? Charles -- ----------------------------------------------------------------------- Charles Cazabon <python@discworld.dyndns.org> GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ -----------------------------------------------------------------------
[Charles Cazabon]
I would love to see how the results would be affected by applying the scoring scheme to the entire content of the message, instead of just the 15 (or 16 in your case) most extreme samples.
Then it would be close to a classic Bayesian classifier, and like any such would need entirely different scoring code to avoid catastrophic floating-point errors (right now an intermediate result can't become smaller than 0.01**16 = 1e-32, so fp troubles are impossible; raise the exponent to a measly 200 and you're already out of the range of IEEE double precision; classic classifiers word in logarithm space instead for this reason). You can read lots of papers on how those do; all evidence suggests they do worse than this scheme on the spam versus non-spam task.
By the way, you never said why you increased that number by one;
It's explained in the comment block preceding the MAX_DISCRIMINATORS definition. BTW, in an unreported experiment I boosted MAX_DISCRIMINATORS to 36. I don't recall what happened now, but it was a disaster for at least one of the error rates.
did it make that much difference?
Not on average. It helped eliminate a narrow class of false positives, where previously the first 15 extremes the classifier saw had 8 probs of .99 and 7 of .01. That works out to "spam". Making the # of classifiers even instead allowed for graceful ties, which favor ham in this scheme. All previous decisions "should be" revisited after each new change, though, and in this particular case it could well be that stipping HTML tags out of plain-text messages also addressed the same narrow issue but in a more effective way (without some special gimmick, virtually every message including so much as an example of HTML got scored as spam).
[Greg Ward]
... Just how many messages fall in that grey area anyways?
Heh. Here's the probability distribution for the 4000 ham messages in my first test pair: Ham distribution for this pair: * = 67 items 0.00 4000 ************************************************************ 2.50 0 5.00 0 7.50 0 10.00 0 12.50 0 15.00 0 17.50 0 20.00 0 22.50 0 25.00 0 27.50 0 30.00 0 32.50 0 35.00 0 37.50 0 40.00 0 42.50 0 45.00 0 47.50 0 50.00 0 52.50 0 55.00 0 57.50 0 60.00 0 62.50 0 65.00 0 67.50 0 70.00 0 72.50 0 75.00 0 77.50 0 80.00 0 82.50 0 85.00 0 87.50 0 90.00 0 92.50 0 95.00 0 97.50 0 That is, they *all* got a "probability score" less than 2.5% (0.025). Here's the spam probability distribution across the same run: Spam distribution for this pair: * = 46 items 0.00 5 * 2.50 2 * 5.00 1 * 7.50 0 10.00 0 12.50 0 15.00 1 * 17.50 0 20.00 1 * 22.50 0 25.00 2 * 27.50 1 * 30.00 0 32.50 1 * 35.00 0 37.50 0 40.00 0 42.50 0 45.00 1 * 47.50 1 * 50.00 1 * 52.50 0 55.00 0 57.50 1 * 60.00 3 * 62.50 0 65.00 2 * 67.50 0 70.00 0 72.50 0 75.00 1 * 77.50 1 * 80.00 0 82.50 0 85.00 0 87.50 0 90.00 3 * 92.50 1 * 95.00 6 * 97.50 2715 ************************************************************ IOW, a spam usually scored at least 0.975 on this run, but some spams scored under 0.025. There's very little "in the middle". I've got 19 more sets like this if you care a lot <wink>. Here's the aggregate across all 20 runs (each msg is counted 4 times here, once for each of the runs in which it served in the prediction set against training on one of the 4 spam+ham collection pairs it doesn't belong to): Ham distribution for all runs: * = 1333 items 0.00 79938 ************************************************************ 2.50 8 * 5.00 3 * 7.50 0 10.00 3 * 12.50 1 * 15.00 3 * 17.50 1 * 20.00 1 * 22.50 0 25.00 0 27.50 0 30.00 1 * 32.50 4 * 35.00 2 * 37.50 0 40.00 2 * 42.50 0 45.00 1 * 47.50 1 * 50.00 1 * 52.50 0 55.00 0 57.50 0 60.00 0 62.50 1 * 65.00 0 67.50 0 70.00 2 * 72.50 0 75.00 1 * 77.50 1 * 80.00 0 82.50 0 85.00 1 * 87.50 1 * 90.00 0 92.50 1 * 95.00 1 * 97.50 21 * Spam distribution for all runs: * = 905 items 0.00 215 * 2.50 18 * 5.00 8 * 7.50 12 * 10.00 6 * 12.50 6 * 15.00 14 * 17.50 6 * 20.00 10 * 22.50 8 * 25.00 9 * 27.50 9 * 30.00 3 * 32.50 3 * 35.00 5 * 37.50 3 * 40.00 7 * 42.50 24 * 45.00 3 * 47.50 29 * 50.00 34 * 52.50 8 * 55.00 6 * 57.50 18 * 60.00 64 * 62.50 12 * 65.00 7 * 67.50 5 * 70.00 3 * 72.50 7 * 75.00 4 * 77.50 18 * 80.00 10 * 82.50 23 * 85.00 13 * 87.50 20 * 90.00 27 * 92.50 18 * 95.00 57 * 97.50 54256 ************************************************************ In percentage terms, very little lives outside the tips of the tail ends. Note that calling the spam cutoff 0.975 instead of 0.90 would save 2 false positives, at the expense of letting an additional 27+18+57 = 102 spams go thru. Here's the first example of a low-prob spam: """ Low prob spam! 0.0133104753792 Data/Spam/Set2/8007.txt prob('from:email name:<janet691') = 0.5 prob('the') = 0.5 prob('subject:Fred') = 0.5 prob('you') = 0.5 prob('was') = 0.305052 prob('bool:noorg') = 0.614515 prob('proposal') = 0.100629 prob('will') = 0.557569 prob('talk') = 0.507463 prob('send') = 0.858078 prob('nice') = 0.227838 prob('from:email addr:ac') = 0.0754717 prob('from:email addr:uk>') = 0.0488301 prob('thanks,') = 0.0300188 prob('subject:Hey') = 0.99 prob('today') = 0.852792 Return-Path: <janet691@cranfield.ac.uk> Delivered-To: bruce-spam@localhost Received: (qmail 14409 invoked by alias); 6 Mar 2002 20:07:42 -0000 Delivered-To: spam@bruce-guenter.dyndns.org Received: (qmail 14405 invoked from network); 6 Mar 2002 20:07:42 -0000 Received: from agamemnon.bfsmedia.com (204.83.201.2) by lorien.untroubled.org (192.168.1.3) with SMTP; 06 Mar 2002 20:07:42 -0000 Received: (qmail 13063 invoked by uid 500); 6 Mar 2002 20:02:05 -0000 Delivered-To: em-ca-spam@em.ca Received: (qmail 13057 invoked by uid 502); 6 Mar 2002 20:02:05 -0000 Delivered-To: bfsmedia-goose.kennels@bfsmedia.com Received: (qmail 13051 invoked from network); 6 Mar 2002 20:02:05 -0000 Received: from unknown (HELO smtp2.forserve.com) (63.170.11.221) by agamemnon.bfsmedia.com with SMTP; 6 Mar 2002 20:02:05 -0000 Date: Wed, 6 Mar 2002 15:12:41 -0500 Message-Id: <200203062012.g26KCfn08192@smtp2.forserve.com> X-Mailer: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.1) Gecko/20010607 Reply-To: <janet691@cranfield.ac.uk> From: <janet691@cranfield.ac.uk> To: <goose01977@bellsouth.net> Subject: Hey Fred Content-Length: 95 Lines: 9 Fred, It was nice to talk to you today I will send the proposal tonight. Thanks, Heidi """ You figure it out <wink>. I suspect bfsmedia would have added a high spam score if I looked at Received lines, but even several additional strong spam indicators wouldn't be enough to nail this one. BTW, this msg shows up many times in the spam corpora, varying the "Fred" and "Heidi" with other male and female names; I assume this is a harvester that's trying to provoke the recipient into replying. Several others are damaged in ways such that the email pkg can't create a msg out of them. I could easily enough add code to force such a msg to be considered spam. Some are wildly embarrassing failures: """ Low prob spam! 0.000102019995919 Data/Spam/Set3/681.txt prob('common,') = 0.01 prob('definately') = 0.01 prob('logic') = 0.01 prob('hell,') = 0.01 prob('it".') = 0.01 prob('obvious.') = 0.01 prob('theory') = 0.01 prob('whilst') = 0.01 prob('earning') = 0.99 prob('same,') = 0.01 prob('$500,000') = 0.99 prob('"bull",') = 0.99 prob('year!!!') = 0.99 prob('internet!') = 0.99 prob('tv:') = 0.99 prob('*this') = 0.99 Return-Path: <ihrockrat3213@hotmail.com> Delivered-To: em-ca-bruceg@em.ca Received: (qmail 25721 invoked from network); 17 Aug 2002 01:05:07 -0000 Received: from unknown (HELO 65.102.48.161) (65.102.48.161) by churchill.factcomp.com with SMTP; 17 Aug 2002 01:05:07 -0000 Received: from unknown (149.89.93.47) by rly-xr02.mx.aol.com with NNFMP; Aug, 17 2002 1:50:22 AM -0800 Received: from anther.webhostingtalk.com ([88.58.121.118]) by da001d2020.lax-ca.osd.concentric.net with QMQP; Aug, 17 2002 12:40:13 AM -0700 Received: from 34.57.158.148 ([34.57.158.148]) by rly-xr02.mx.aol.com with local; Aug, 17 2002 12:02:05 AM +0300 From: rnpyjohn <ihrockrat3213@hotmail.com> To: Undisclosed Recipients Cc: Subject: Please read this letter carefully, it works 100% Sender: rnpyjohn <ihrockrat3213@hotmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Date: Sat, 17 Aug 2002 02:03:28 +0100 X-Mailer: The Bat! (v1.52f) Business X-Priority: 1 Content-Length: 15985 *This is a one time mailing and this list will never be used again.* Hi, SEEN THIS MAIL BEFORE?, SICK OF FINDING IT IN YOUR INBOX? ME TOO, HONEST I was exactly the same, till one day whilst i was complaining about how tired i was of seeing ... """ The first 16 most extreme indicators are split 9 highly in favor of ham (.01) and 7 highly in favor of spam (.99). If I hadn't folded case away to let stinking conference announcements through <wink>, I expect it would have latched on to the SCREAMING at the start instead of looking deeper. Looking at the To: line probably would nail this one too, as "Undisclosed Recipients" has two 0.99 spam indicators right there. Whatever, you *don't* want to look at msgs with a mix of just 0.99 and 0.01 thingies: it's not all that unusual to get such an extreme mix, in spam or ham. this-isn't-your-father's-idea-of-probability<wink>-ly y'rs - tim
[Tim]
... The first 16 most extreme indicators are split 9 highly in favor of ham (.01) and 7 highly in favor of spam (.99). If I hadn't folded case away to let stinking conference announcements through <wink>, I expect it would have latched on to the SCREAMING at the start instead of looking deeper. Looking at the To: line probably would nail this one too, as "Undisclosed Recipients" has two 0.99 spam indicators right there.
Whatever, you *don't* want to look at msgs with a mix of just 0.99 and 0.01 thingies: it's not all that unusual to get such an extreme mix, in spam or ham.
I should have added that it usually gets the right result when this happens. It's the exceptions to that rule that are mondo embarrassing, because it's making a mistake then while sitting on a mountain of strong evidence (albeit pointing as extremely as possible in both directions at once <wink>). "A problem" is that when a MIN_SPAMPROB and MAX_SPAMPROB clue both appear, the math is such that they cancel out exactly. It's *almost* as if neither existed, but not quite: they also keep two lower-probability words *out* of the computation (only a grand total of the MAX_DISCRIMINATORS most extreme clues are retained). So I changed spamprob() to keep accepting more clues when MIN/MAX cancellations are inevitable, and to use the best of those in lieu of the cancelling extremes. This turned out to be a pure win: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied 0.000 0.000 tied 0.025 0.025 tied 0.025 0.025 tied 0.050 0.050 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.075 0.075 tied 0.025 0.025 tied 0.025 0.025 tied 0.025 0.025 tied 0.075 0.025 won 0.025 0.025 tied 0.025 0.025 tied 0.000 0.000 tied 0.025 0.025 tied 0.050 0.050 tied won 1 times tied 19 times lost 0 times total unique fp went from 9 to 7 false negative percentages 0.909 0.764 won 0.800 0.691 won 1.091 0.981 won 1.381 1.309 won 1.491 1.418 won 1.055 0.873 won 0.945 0.800 won 1.236 1.163 won 1.564 1.491 won 1.200 1.200 tied 1.454 1.381 won 1.599 1.454 won 1.236 1.164 won 0.800 0.655 won 0.836 0.655 won 1.236 1.163 won 1.236 1.200 won 1.055 0.982 won 1.127 0.982 won 1.381 1.236 won won 19 times tied 1 times lost 0 times total unique fn went from 284 to 260
[ lots of interesting stuff elided ] Tim> What's an acceptable false positive rate? What do we get from Tim> SpamAssassin? I expect we can end up below 0.1% here, and with a Tim> generous meaning for "not spam", but I think *some* of these Tim> examples show that the only way to get a 0% false-positive rate is Tim> to recode spamprob like so: I don't know what an acceptable false positive rate is. I guess it depends on how important those falsies are. ;-) One thing I think would be worthwhile would be to run GBayes first, then only run stuff it thought was spam through SpamAssassin. Only messages that both systems categorized as spam would drop into the spam folder. This has a couple benefits over running one or the other in isolation: * The training set for GBayes probably doesn't need to be as big * The two systems use substantially different approaches to identifying spam, so I suspect your false positive rate would go way down. False negatives would go up, but only testing can suggest by how much. * Since SA is dog slow most of the time, SA users get a big speedup, since a substantially smaller fraction of your messages get run through it. This sort of chaining is pretty trivial to setup with procmail. Dunno what the Windows set will do though. Skip
[Skip Montanaro]
... One thing I think would be worthwhile would be to run GBayes first, then only run stuff it thought was spam through SpamAssassin. Only messages that both systems categorized as spam would drop into the spam folder. This has a couple benefits over running one or the other in isolation:
* The training set for GBayes probably doesn't need to be as big
Training GBayes is cheap, and the more you feed it the less need to do information-destroying transformations (like folding case or ignoring punctuation).
* The two systems use substantially different approaches to identifying spam,
Which could indeed be a killer-strong benefit.
so I suspect your false positive rate would go way down.
I'm already having a real problem with this just looking at content: the false positive rate is already so low that I can't make statistically significant conclusions about things that may improve it (e.g., if I do something that removes just *one* false positive in a test run on 4000 hams, the false-positive rate falls by 12.5% -- I don't have enough false positives to make fine-grained judgments. And, indeed, every time I test a change to the algorithm, the most *significant* thing I find is that it turns up another class of blatant spam hiding in the ham corpus: my training data is still too dirty, and cleaning it up is labor-intensive).
False negatives would go up, but only testing can suggest by how much.
* Since SA is dog slow most of the time, SA users get a big speedup, since a substantially smaller fraction of your messages get run through it.
This sort of chaining is pretty trivial to setup with procmail. Dunno what the Windows set will do though.
There are different audiences here. Greg is keen to have a better approach for python.org as a whole, while Barry is keen about that and about doing something more generic for Mailman. Windows isn't an issue for either of those. Everyone else can eat cake <wink>.
I've gotten interesting results from this gimmick: import re url_re = re.compile(r"http://([^\s>'\"\x7f-\xff]+)", re.IGNORECASE) urlfield_re = re.compile(r"[;?:@&=+,$.]") def tokenize_url(string): for url in url_re.findall(string): for i, piece in enumerate(url.lower().split('/')): prefix = "url%d:" % i for chunk in urlfield_re.split(piece): yield prefix + chunk ... (and then do other tokenization) ... So it splits a case-normalized http thingie via /, tags the first piece "url0:", the second "url1:", and so on. Within each piece, it splits on separators, like '=' and '.'. Two particular tokens generated this way then made it into the list of 15 words that most often survived to the end of the scoring step: url0:python as a strong non-spam indicator url1:remove as a strong spam indicator The rest of the tokenization was unchanged, still doing MIME-ignorant splitting on whitespace. Just the http gimmick was added, and that alone cut the false negative rate in half. IOW, there's a *lot* of valuable info in the http thingies! Not being a Web Guy, I'm not sure how to extract the most info from it. If you've got suggestions for a better URL tagging strategy, I'd love to hear them. Cute: If I tokenize *only* the http thingies, ignoring all other parts of the text, the false positive rate is about 1%. This is because most legit msgs don't have any http thingies, so they get classified correctly as ham (no tokens at all are generated for them). This caught at least one spam in the ham corpus (a bogus "false positive"): Data/Ham/Set2/8695.txt prob = 0.999997392672 prob('url0:240') = 0.2 prob('url1:') = 0.612567 prob('url0:250') = 0.99 prob('url0:225') = 0.99 prob('url0:207') = 0.99 Sweet XXX! http://207.240.225.250/ II33bp-] An example of a real false positive was due to /F including this URL: http://w1.132.telia.com/~u13208596/temp/py15-980706.zip Oddly enough, prob('url0:132') = 0.99 prob('url0:telia') = 0.99 so there was significant spam with "132" and "telia" in the first field of an http thingie. The false negative rate when tokenizing only http thingies zoomed to over 30%. Curiously, the best way for a spam to evade this check is *not* to disguise itself with numeric IPs. Numbers end up looking suspicious. But, e.g., this looks netural: http://shocking-incest.com prob('url0:com') = 0.658328 and it never saw "shocking-incest" before.
Tim Peters <tim.one@comcast.net>:
Spammers often generate random "word-like" gibberish at the ends of msgs, and "rd" is one of the random two-letter combos that appears in the spam corpus. Perhaps it would be good to ignore "words" with fewer than W characters (to be determined by experiment).
Bogofilter throws out words of length one and two.
I expect that including the headers would have given these much better chances of getting through, given Robin and Alex's posting histories. Still, the idea of counting words multiple times is open to question, and experiments both ways are in order.
And bogofilter includes the headers. This is important, since otherwise you don't rate things like spamhaus addresses and sender names. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
[Eric S. Raymond]
Bogofilter throws out words of length one and two.
Right, I saw that. It's something I'll run experiments against later. I'm running a 5x5 test grid (skipping the diagonal), and as was also true in speech recognition, if I had been running against just one spam+ham training corpora and just one spam+ham prediction set, I would have erroneously concluded that various things either are improvements, are regressions, or don't matter. But some ideas obtained from staring at mistakes from one test run turn out to be irrelevant, or even counter-productive, if applied to other test runs. The idea that some notion of "word" is important seems highly defensible <wink>, but beyond that I discount claims that aren't derived from a similarly paranoid testing setup.
... And bogofilter includes the headers. This is important, since otherwise you don't rate things like spamhaus addresses and sender names.
Of course -- the reasons I'm not using headers in these particular tests have been spelled out several times. They'll get added later, but for now I don't have a large enough test set where doing so doesn't render the classifier's job trivial.
[Tim, predicting a false-positive rate]
I expect we can end up below 0.1% here, and with a generous meaning for "not spam",
We're there now, and still ignoring the headers.
but I think *some* of these examples show that the only way to get a 0% false-positive rate is to recode spamprob like so:
def spamprob(self, wordstream, evidence=False): return 0.0
Likewise. I'll check in what I've got after this. Changes included: + Using the email pkg to decode (only) text parts of msgs, and, given multipart/alternative with both text/plain and text/html branches, ignoring the HTML part (else a newbie will never get a msg thru: all HTML decorations have monster-high spam probabilities). + Boosting MAX_DISCRIMINATORS, from 15 to 16. + Ignoring very short and very long "words" (this is Eurocentric). + Neither counting unique words once nor an unbounded number of times in the scoring. A word is counted at most twice now. This helps otherwise spamish msgs that have *some* highly relevant content, but doesn't, e.g., let spam through just because it says "Python" 80 times at the start. It helps the false negative rate more, although that may really be due to that UNKNOWN_SPAMPROB is too low (UNKNOWN_SPAMPROB is irrelevant to any of the false positives remaining, so I haven't run any tests varying that yet). I'll attach a complete listing of all false positives across the 20,000 ham msgs I've been using. People using c.l.py as an HTML clinic are out of luck. I'd personally call at least 5 of them spam, but I've been very reluctant to throw msgs out of the "good" archive -- nobody would question the ones I did throw out and replace. The false negative rate is still relatively high. In part that comes from getting the false positive rate so low (this is very much a tradeoff when both get low!), and in part because the spam corpus has a surprising number of msgs with absolutely nothing in the bodies. The latter generate no tokens, so end up with "probability" 0.5. The only thing I tried that cut the false negative rate in a major way was the special parsing+tagging of URLs in the body (see earlier msg), and that was a highly significant aid (it cut the false negative rate in half). There's good reason to hope that adding headers into the scoring would slash the false negative rate. Full results across all 20 runs; floats are percentages: Training on Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams false positive: 0.025 false negative: 2.10909090909 testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams false positive: 0.05 false negative: 2.47272727273 testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams false positive: 0.1 false negative: 2.50909090909 testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams false positive: 0.0500125031258 false negative: 2.8 Training on Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams false positive: 0.05 false negative: 2.8 testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams false positive: 0.075 false negative: 2.47272727273 testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams false positive: 0.15 false negative: 2.36363636364 testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams false positive: 0.0500125031258 false negative: 2.43636363636 Training on Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams false positive: 0.075 false negative: 3.16363636364 testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams false positive: 0.075 false negative: 2.43636363636 testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams false positive: 0.15 false negative: 2.90909090909 testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams false positive: 0.0750187546887 false negative: 2.61818181818 Training on Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams false positive: 0.1 false negative: 2.65454545455 testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams false positive: 0.1 false negative: 1.81818181818 testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams false positive: 0.1 false negative: 2.25454545455 testing against Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams false positive: 0.0750187546887 false negative: 2.50909090909 Training on Data/Ham/Set5 & Data/Spam/Set5 ... 3999 hams & 2750 spams testing against Data/Ham/Set1 & Data/Spam/Set1 ... 4000 hams & 2750 spams false positive: 0.075 false negative: 2.94545454545 testing against Data/Ham/Set2 & Data/Spam/Set2 ... 4000 hams & 2750 spams false positive: 0.05 false negative: 2.07272727273 testing against Data/Ham/Set3 & Data/Spam/Set3 ... 4000 hams & 2750 spams false positive: 0.1 false negative: 2.58181818182 testing against Data/Ham/Set4 & Data/Spam/Set4 ... 4000 hams & 2750 spams false positive: 0.15 false negative: 2.83636363636 The false positive rates vary by a factor of 6. This isn't significant, because the absolute numbers are so small; 0.025% is a single message, and it never gets higher than 0.150%. At these rates, I'd need test coropora about 10x larger to draw any fine distinction among false positive rates with high confidence.
participants (10)
-
Charles Cazabon
-
Eric S. Raymond
-
Greg Ward
-
Greg Ward
-
Guido van Rossum
-
Jason R. Mastaler
-
Neil Schemenauer
-
Skip Montanaro
-
Tim Peters
-
Tim Peters