RE: [Python-Dev] The first trustworthy <wink> GBayes results
From: Delaney, Timothy [mailto:tdelaney@avaya.com]
Whether any weighting should be applied to single words or word pairs I don't know - my gut feeling is that they should be weighted the same, but guts are no replacement for empirical evidence.
On second thought - if a word-pair appears, then the separate parts should not be checked as separate words. So, If I had scores: 'free' 0.1 'beer' 0.1 ('want', 'free',) 0.9 ('free', 'beer',) 0.01 ('free', '!!!',) 0.99 then the following phrases would match (case-folding) as: 'I want free beer!!!': ('want', 'free',) 0.9 ('free', 'beer',) 0.01 'Get *** for free!!!' ('free', '!!!',) 0.99 'I want free beer. Free the beer!!!' ('want', 'free',) 0.9 ('free', 'beer',) 0.01 'free' 0.1 'beer' 0.1 Damn I wish I was at home to try this out ... :( Tim Delaney
[Delaney, Timothy]
On second thought - if a word-pair appears, then the separate parts should not be checked as separate words.
So, If I had scores:
'free' 0.1 'beer' 0.1 ('want', 'free',) 0.9 ('free', 'beer',) 0.01 ('free', '!!!',) 0.99
then the following phrases would match (case-folding) as:
'I want free beer!!!':
('want', 'free',) 0.9 ('free', 'beer',) 0.01
'Get *** for free!!!'
('free', '!!!',) 0.99
'I want free beer. Free the beer!!!'
('want', 'free',) 0.9 ('free', 'beer',) 0.01 'free' 0.1 'beer' 0.1
Damn I wish I was at home to try this out ... :(
I'm going to say a lot of stuff here, and then shut up <wink>. I want to move on to other things, but there's an opportunity to pass on some darned good advice for those who can hear. Combining pairs of words is called "word bigrams". My intuition at the start was that it would do better. OTOH, my intuition also was that character n-grams for a relatively large n would do better still. The latter may be so for "foreign" languages, but for this particular task using Graham's scheme on the c.l.py tests, turns out they sucked. A comment block in timtest.py explains why. I didn't try word bigrams because the f-p rate is already supernaturally low, so there doesn't seem anything left to be gained there. This echoes what Graham sez on his web page: One idea that I haven't tried yet is to filter based on word pairs, or even triples, rather than individual words. This should yield a much sharper estimate of the probability. My comment with benefit of hindsight: it doesn't. Because the scoring scheme throws away everything except about a dozen extremes, the "probabilities" that come out are almost always very near 0 or very near 1; only very short or (or especially "and") very bland msgs come out in between. This outcome is largely independent of the tokenization scheme -- the scoring scheme forces it, provided only that the tokenization scheme produces stuff *some* of which *does* vary in frequency between spam and ham. For example, in my current database, the word "offers" has a probability of .96. If you based the probabilities on word pairs, you'd end up with "special offers" and "valuable offers" having probabilities of .99 and, say, "approach offers" (as in "this approach offers") having a probability of .1 or less. The theory is indeed appealing <wink>. The reason I haven't done this is that filtering based on individual words already works so well. Which is also the reason I didn't pursue it. But it does mean that there is room to tighten the filters if spam gets harder to detect. I expect it would also need a different scoring scheme then. OK, I ran a full test using word bigrams. It gets one strike against it at the start because the database size grows by a factor between 2 and 3. That's only justified if the results are better. Before-and-after f-p (false positive) percentages: before bigrams 0.000 0.025 0.000 0.025 0.050 0.050 0.000 0.025 0.025 0.050 0.025 0.100 0.050 0.075 0.025 0.025 0.025 0.050 0.000 0.025 0.075 0.050 0.050 0.000 0.025 0.050 0.000 0.025 0.050 0.075 0.025 0.025 0.025 0.025 0.000 0.000 0.025 0.050 0.050 0.025 Lost on 12 runs Tied on 5 runs Won on 3 runs total # of unique fps across all runs rose from 8 to 17 The f-n percentages on the same runs: before bigrams 1.236 1.091 1.164 1.091 1.454 1.708 1.599 1.563 1.527 1.491 1.236 1.127 1.163 1.345 1.309 1.309 1.891 1.927 1.418 1.382 1.745 1.927 1.708 1.963 1.491 1.782 0.836 0.800 1.091 1.127 1.309 1.309 1.491 1.709 1.127 1.018 1.309 1.018 1.636 1.672 Lost on 9 runs Tied on 2 runs Won on 9 runs total # of unique fns across all runs rose from 336 to 350 This doesn't need deep analysis: it costs more, and on the face of it either doesn't help, or helps so little it's not worth the cost. Now I'll tell in you confidence <wink> that the way to make a scheme like this excellent is to keep your ego out of it and let the data *tell* you what works: getting the best test setup you can is the most important thing you can possibly do, which must include multiple training and test corpora (e.g., if I had used only one pair, I would have had a 3/20 chance of erroneously concluding that bigrams might help the f-p rate, when running across 20 pairs shows that they almost certainly do it harm; while I would have had an even chance of drawing a wrong conclusion-- in either direction --about the effect on the f-n rate). The second most important thing is to run a fat test all the way to the end before concluding anything. A subtler point is that you should never keep a change that doesn't *prove* itself a winner: neutral changes bloat your code with proven irrelevancies that will come back to make your life harder later, in part because they'll randomly interfere with future changes in ways that make it harder to recognize a significant change when you stumble into one. Most things you try won't help -- indeed, many of them will deliver worse results. I dare say my intution for this kind of classification task is better than most programmers' (in part because I had years of professional experience in a related field), and most of the things I tried I had to throw away. BFD -- then you try something else. When I find something that works I can rationalize it, but when I try something that doesn't, no amount of argument can change that the data said it sucked <wink>. Two things about *this* task have fooled me repeatedly: 1. The "only look at smoking guns" nature of the scoring step makes many kinds of "on average" intuitions worthless: "on average" almost everything is thrown away! For example, you're not going to find bad results reported for n-grams (neither character- nor word-based) in the literature, and because most scoring schemes throw much less away. Graham's scheme strikes as brilliant in this specific respect: it's worth enduring the ego humiliation <wink> to get such a spectacularly low f-p rate from such simple and fast code. 2. Most mailing-list messages are much shorter than this one. This systematically frustrates "well, averaged over enough words" intuitions too. Cute: In particular, word bigrams systematically hate conference announcements. The current word one-gram scheme hated them too, until I started folding case. Then their SCREAMING stopped acting against them. But they're still using the language of advertisement, and word bigrams can't help but notice that more strongly than individual words do. Here from the TOOLS Europe '99 announcement: prob('more information') = 0.916003 prob('web site') = 0.895518 prob('please write') = 0.99 prob('you wish') = 0.984494 prob('our web') = 0.985578 prob('visit our') = 0.99 Here from the XP2001 - FINAL CALL FOR PAPERS: prob('web site:') = 0.926174 prob('receive this') = 0.945813 prob('you receive') = 0.987542 prob('most exciting') = 0.99 prob('alberta, canada') = 0.99 prob('e-mail to:') = 0.99 Here from the XP2002 - CALL FOR PRACTITIONER'S REPORTS ('BOM' is an artificial token I made up for "beginning of message", to give something for the first word in the message to pair up with): prob('web site:') = 0.926174 prob('this announcement') = 0.94359 prob('receive this') = 0.945813 prob('forward this') = 0.99 prob('e-mail to:') = 0.99 prob('BOM *****') = 0.99 prob('you receive') = 0.987542 Here from the TOOLS Europe 2000 announcement: prob('visit the') = 0.96 prob('you receive') = 0.967805 prob('accept our') = 0.99 prob('our apologies') = 0.99 prob('quality and') = 0.99 prob('receive more') = 0.99 prob('asia and') = 0.99 A vanilla f-p showing where bigrams can hurt was a short msg about setting up a Python user's group. Bigrams gave it large penalties for phrases like "fully functional" (most often seen in spams for bootleg software, but here applied to the proposed user group's web site -- and "web site" is also a strong spam indicator!). OTOH, the poster also said "Aahz rocks". As a bigram, that neither helped nor hurt (that 2-word phrase is unique in the corpus); but as an individual word, "Aahz" is a strong non-spam indicator on c.l.py (and will probably remain so until he starts spamming <wink>). It did find one spam hiding in a ham corpus: """ NNTP-Posting-Host: 212.64.45.236 Newsgroups: comp.lang.python,comp.lang.rexx Date: Thu, 21 Oct 1999 10:18:52 -0700 Message-ID: <67821AB23987D311ADB100A0241979E5396955@news.ykm.com> From: znblrn@hetronet.com Subject: Rudolph The Rednose Hooters Here Lines: 4 Path: news!uunet!ffx.uu.net!newsfeed.fast.net!howland.erols.net!newsfeed.cwix.com! news.cfw.com!paxfeed.eni.net!DAIPUB.DataAssociatesInc..com Xref: news comp.lang.python:74468 comp.lang.rexx:31946 To: python-list@python.org THis IS it: The site where they talk about when you are 50 years old. http://huizen.dds.nl/~jansen20 """ there's-no-substitute-for-experiment-except-drugs-ly y'rs - tim
[Tim Peters] [... extremely good work and stuff and comments, for a good while now ...] Hi, Tim. I read your messages, witnessing your work and progress in that area, with great interest, and also saved them for later contemplation! :-) Spam always annoyed me, as most of us, and despite many efforts I did, it is increasingly successful at traversing my filters -- so this idea of Graham or Bayesian filters is timely and welcome. Most previous filters I observed are based on various (random) tests or events (you surely know all this), and `procmail'-based filters, or even the popular SpamAssassin, are either very slow or at least slow. The tool I use since 1998 is much faster, especially after I rewrote it in Python!, it is also based on various tests or events. Your works concentrated on tuning the statistical formulas and lexical analysis, and building operational data from preset corpora. I'm sure all the knowledge gleaned there will make its way everywhere, and reach me. For a tiny share, I decided to experiment with day-to-day user aspects of using such a filter, and built a Gnus interface over Eric Raymond's Bogofilter. There are two functions to this program, one is about learning from messages known to be ham or spam, the other is about classification of incoming messages. By the way, if there are Gnus users among you, just ask me for the recipe... It goes pretty well for me, so far. The principle, put forward by Paul Graham, is to let the user have two delete commands: delete-as-ham or delete-as-spam. Eric pushed this idea a bit further by postponing learning until the user quits the mail reader, `mutt' in his case. As Gnus allows me to have many mailgroups and folders and shuffle between them, I postpone learning until the user switches mailgroups or quit, and only for the _final_ disposition of a message: that is, when a message is merely saved into another folder, the decision will be taken when leaving that other folder, and not the current one. Messages marked as "saved" are _not_ sent, so to avoid double learning. The fact is that ham messages are more likely to be postponed than spam, because ham is more often filed here and there. Even if many or most ham messages are deleted, this introduce a short term bias in the learning statistics by which the percentage of spam seems to be higher (in my case, 1157 messages have been learned in about three days, 20% of which were spam), but this percentage will later be lowered as filed messages get reprocessed. Another effect is that the delay itself in ham learning may have a slight effect on classification, but since both ham and spam are well represented, the effect is likely negligible. Tim corpora are surely very clean, at least by now, while day-to-day learning may yield slightly tainted learning. In my case, when a thread does not interest me, I often kill all articles it contains in one command, without opening each of them to see if it would not be spam: the threading itself makes it unlikely. But nevertheless possible, you surely noticed that bad guys now fetch and re-use already published subjects as a way to get through. That means that if big corpora are thinkable in case of mailing lists having existed for a while, those are probably not very usable for individual users. GBayes, Bogofilter and others should ideally resist some amount of ham-tainted-as-spam or spam-tainted-as-ham at learning time. After adding Graham filtering as a supplementary method to my spam detection tool, I gladly observe that it successfully detects many spam messages which would otherwise fall in the cracks, so it really brings something to me. But I also see many spam cases (are they?) it does not detect and that it would hardly: one simple example is that _for me_, invalidly structured MIME is indicative of an un-interesting message, as interesting people know better! One particular problem I observed are Tim messages themselves, which are undoubtedly very miummy ham messages, but discussing and quoting many spam inside them. Should these be registered as ham or spam? :-) Would not these defeat the learning to some extent? Where should Tim add his own messages in the corpora he uses, and what changes would result in `GBayes' effectiveness? -- François Pinard http://www.iro.umontreal.ca/~pinard
participants (3)
-
Delaney, Timothy
-
pinard@iro.umontreal.ca
-
Tim Peters