Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

tim> Straight character n-grams are very appealing because they're the tim> simplest and most language-neutral; I didn't have any luck with tim> them over the weekend, but the size of my training data was tim> trivial. Anybody up for pooling corpi (corpora?)? Skip

"SM" == Skip Montanaro <skip@pobox.com> writes:
tim> Straight character n-grams are very appealing because they're tim> the simplest and most language-neutral; I didn't have any tim> luck with them over the weekend, but the size of my training tim> data was trivial. SM> Anybody up for pooling corpi (corpora?)? I've got collections from python-dev, python-list, edu-sig, mailman-developers, and zope3-dev, chopped at Feb 2002, which is approximately when Greg installed SpamAssassin. The collections are /all/ known good, but pretty close (they should be verified by hand). The idea is to take some random subsets of these, cat them together and use them as both training and test data, along with some 'net-available known spam collections. No time more to play with this today though... -Barry

Some perhaps relevant links (with no off-topic discusssion): * http://www.tuxedo.org/~esr/bogofilter/ * http://www.ai.mit.edu/~jrennie/ifile/ * http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-125932.news.df... """My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam." If you muddle together the Nigerian Pyramid schemes with the "Penis enhancement" ads along with the offers of new credit cards as well as the latest sites where you can talk to "hot, horny girls LIVE!", the statistics don't work out nearly so well. It's hard to tell, on the face of it, why Nigerian scams _should_ be considered textually similar to phone sex ads, and in practice, the result of throwing them all together" There are a few things left to improve about Ifile, and I'd like to redo it in some language fundamentally less painful to work with than C """ "Barry A. Warsaw" wrote:
-- Paul Prescod

Paul Prescod <paul@prescod.net>:
Some perhaps relevant links (with no off-topic discusssion):
I'm in the process of speed-tuning this now. I intend for it to be blazingly fast, usable for sites that process 100K mails a day, and I think I know how to do that. This is not a natural application for Python :-).
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
Well, except it seems to work quite well. The Nigerian trigger-word population is distinct from the penis-enlargement population, but they both show up under Bayesian analysis. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

[Eric S. Raymond]
I'm not sure about that. The all-Python version I checked in added 20,000 Python-Dev messages to the database in 2 wall-clock minutes. The time for computing the statistics, and for scoring, is simply trivial (this wouldn't be true of a "normal" Bayesian classifier (NBC), but Graham skips most of the work an NBC does, in particular favoring fast classification time over fast model-update time). What we anticipate is that the vast bulk of the time will end up getting spent on better tokenization, such as decoding base64 portions, and giving special care to header fields and URLs. I also *suspect* (based on a previous life in speech recogniation) that experiments will show that a mixture of character n-grams and word bigrams is significantly more effective than a "naive" tokenizer that just looks for US ASCII alphanumeric runs.
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
In fact, I'm going to say "Nigerian" and "penis enlargement" one more time each here, just to demonstrate that *this* message won't be a false positive when the smoke settles <wink>. Human Growth Hormone too, while I'm at it.

(Copied to Paul Graham. Paul, this is the mailing list of the Python maintainers. I thought you'd find the bits about lexical analysis in bogofilter interesting. Pythonistas, Paul is one of the smartest and sanest people in the LISP community, as evidenced partly by the fact that he hasn't been too proud to learn some lessons from Python :-). It would be a good thing for some bridges to be built here.) Tim Peters <tim.one@comcast.net>:
This is one of bogofilter's strengths. It already does this stuff at the lexical level using a speed-tuned flex scanner (I spent a lot of the development time watching token strings go by and tweaking the scanner rules to throw out cruft). In fact, look at this. It's a set of lex rules with interpolated comments: BASE64 [A-Za-z0-9/+] IPADDR [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ MIME_BOUNDARY ^--[^[:blank:]\n]*$ %% # Recognize and discard From_ headers ^From\ {return(FROM);} # Recognize and discard headers that contain dates and tumblers. ^Date:.*|Delivery-Date:.* ; ^Message-ID:.* ; # Throw away BASE64 enclosures. No point in using this as a discriminator; # spam that has these in it always has other triggers in the headers. # Therefore don't pay the overhead to decode it. ^{BASE64}+$ ; # Throw away tumblers generated by MTAs ^\tid\ .* ; SMTP\ id\ .* ; # Toss various meaningless MIME cruft boundary=.* ; name=\" ; filename=\" ; {MIME_BOUNDARY} ; # Keep IP addresses {IPADDR} {return(TOKEN);} # Keep wordlike tokens of length at least three [a-z$][a-z0-9$'.-]+[a-z0-9$] {return(TOKEN);} # Everything else, including all one-and-two-character tokens, # gets tossed. . ; This small set of rules does a remarkably effective job of tossing out everything that isn't a potential recognition feature. Watching the filtered token stream from a large spam corpus go by is actually quite an enlightening experience. It does a better job that Paul's original in one important respect; IP addresses and hostnames are preserved whole for use as recognition features. I think I know why Paul didn't do this -- he's not a C coder, and if lex/flex isn't part of one's toolkit one's brain doesn't tend to wander down design paths that involve elaborate lexical analysis, because it's just too hard. This is actually the first new program I've coded in C (rather than Python) in a good four years or so. There was a reason for this; I have painful experience with doing lexical analysis in Python that tells me flex-generated C will be a major performance win here. The combination of flex and Judy made it a no-brainer.
I share part of your suspicions -- I'm thinking about going to bigram analysis for header lines. But I'm working on getting the framework up first. Feature extraction is orthogonal to both the Bayesian analysis and (mostly) to the data-storage method, and can be a drop-in change if the framework is done right.
I think I know why, too. It's the top-15 selection -- the highest-variance words don't blur into non-spam English the way statistics on *all* tokens would. It's like an edge filter. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

Eric> This is one of bogofilter's strengths. It already does this stuff Eric> at the lexical level using a speed-tuned flex scanner (I spent a Eric> lot of the development time watching token strings go by and Eric> tweaking the scanner rules to throw out cruft). This reminds me of something which tickled my interesting bone the other day. The SpamAssassin folks are starting to look at Flex for much faster regular expression matching in situations where large numbers of static re's must be matched. I wonder if using something like SciPy's weave tool would make it (relatively) painless to incorporate fairly high-speed scanners into Python programs. It seems like it would just be an extra layer of compilation for something like weave. Instead of inserting C code into a string, wrapping it with module sticky stuff and compiling it, you'd insert Flex rules into the string, call a slightly higher level function which calls flex to generate the scanner code and use a slightly different bit of module sticky stuff to make it callable from Python. Skip

Skip Montanaro <skip@pobox.com>:
*snort* Took'em long enough. No, I shouldn't be snarky. Flex is only obvious to Unix old-timers -- the traditions that gave rise to it have fallen into desuetitude in the last decade.
Lexers are painful in Python. They hit the language in a weak spot created by the immutability of strings. I've found this an obstacle more than once, but then I'm a battle-scarred old compiler jock who attacks *everything* with lexers and parsers. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

>> ...insert Flex rules into the string, call a slightly higher level >> function which calls flex to generate the scanner code and use a >> slightly different bit of module sticky stuff to make it callable >> from Python. Eric> Lexers are painful in Python. They hit the language in a weak Eric> spot created by the immutability of strings. Yeah, that's why you inline what is essentially a .l file into your Python code. ;-) I'm actually here in Austin for a couple days visiting Eric Jones and the SciPy gang. Perhaps Eric and I can bat something out over lunch tomorrow... Skip

I think you're exaggerating the problem, or at least underestimating the re module. The re module is pretty fast! Reading a file line-by-line is very fast in Python 2.3 with the new "for line in open(filename)" idiom. I just scanned nearly a megabyte of ugly data (a Linux kernel) in 0.6 seconds using the regex '\w+', finding 177,000 words. The regex (?:\d+|[a-zA-Z_]+) took 1 second, yielding 1 second, finding 190,000 words. I expect that the list creation (one hit at a time) took more time than the matching. --Guido van Rossum (home page: http://www.python.org/~guido/)

I'm mildly curious why nobody has suggested mxTextTools or anything like that. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

I'm mildly curious why mxTextTools proponents
eh? why did my mailer send that mail? what I was trying to say is that I'm mildly curious why people tend to treat mxTextTools like some kind of silver bullet, without actually comparing it to a well- written regular expression. I've heard from people who've spent days rewriting their application, only to find that the resulting program was slower. (as Guido noted, for problems like this, the overhead isn't so much in the engine itself, as in the effort needed to create Python data structures...) </F>

On Wed, Aug 21, 2002, Fredrik Lundh wrote:
Okay, so that's one datapoint. I've never actually used mxTextTools; I'm mostly going by comments Tim Peters has made in the past suggesting that regex tools are poor for parsing. Since he's the one saying that regex is fast enough this time, I figured it'd be an appropriate time to throw up a question. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

On 21 Aug 2002 at 8:28, Aahz wrote:
They suck for parsing. They excel for lexing, however. -- Gordon http://www.mcmillan-inc.com/

aahz> I'm mostly going by comments Tim Peters has made in the past aahz> suggesting that regex tools are poor for parsing. parsing != tokenizing. ;-) Regular expressions are great for tokenizing (most of the time). Skip

On Wed, Aug 21, 2002, Skip Montanaro wrote:
Ah. Here we see one of the little drawbacks of not finishing my CS degree. ;-) Can someone suggest a good simple reference on the distinctions between parsing / lexing / tokenizing, particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

aahz wrote:
start here: http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=parser
particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology?
words tend to mean slightly different things in the XML universe, so I'll leave that to the XML experts. </F>

Aahz <aahz@pythoncraft.com>:
It's pretty simple, actually. Lexing *is* tokenizing; it's breaking the input stream into appropopriate lexical units. When you say "lexing" it implies that your tokenizer may be doing other things as well -- handling comment syntax, or gathering low-level semantic information like "this is a typedef". Parsing, on the other hand, consists of attempting to match your input to a grammar. The result of a parse is typically either "this matches" or to throw an error. There are two kinds of parsers -- event generators and structure builders. Event generators call designated hook functions when they recognize a piece of syntax you're interested in. In XML-land, SAX is like this. Structure builders return some data structure (typically a tree) representing the syntax of your input. In XML-land, DOM is like this. There is a vast literature on parsing. You don't need to know most of it. The key thing to remember is that, except for very simple cases, writing parsers by hand is usually stupid. Even when it's simple to do, machine-generated parsers have better hooks for error recovery. There are several `parser generator' tools that will compile a grammar specification to a parser; the best-known one is Bison, an open-source implementation of the classic Unix tool YACC (Yet Another Compiler Compiler). Python has its own parser generator, SPARK. Unfortunately, while SPARK is quite powerful (that is, good for handling ambiguities in the spec), the Earley algorithm it uses gives O(n**3) performance in the generated parser. It's not usable for production on larger than toy grammars. The Python standard library includes a lexer class suitable for a large class of shell-like syntaxes. As Guido has pointed out, regexps provide another attack on the problem. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

SPARK can hardly make clame to the fame of being "Python's own parser generator". While it's a parser generator for Python programs and itself written in Python, it is not distributed with Python. "Python's own" would be pgen, which lives in the Parser subdirectory of the Python source tree. Pgen is used to parse the Python source code and construct a parse tree out of it. As parser generators go, pgen is appropriately (and pythonically) stupid -- its power is restricted to that of LL(1) languages, equivalent to recursive-descent parsers. Its only interesting feature may be that it uses a regex-like notation to feed it the grammar for which to generate a parser. (That's what the *, ?, [], | and () meta-symbols in the file Grammar/Grammar are for.) I would note that for small languages (much smaller than Python), writing a recursive-descent parser by hand is actually one of the most effective ways of creating a parser. I recently had the pleasure to write a recursive-descent parser for a simple Boolean query language; there was absolutely no need to involve a big gun like a parser generator. OTOH I would not consider writing a recursive-descent parser by hand for Python's Grammar -- that's why I created pgen in the first place. :-) Another note for Aahz: when it comes to scanning data that's not really a programming language, e.g. email messages, the words parsing, scanning, lexing and tokenizing are often used pretty much interchangeably. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Aug 21, 2002 at 03:28:51PM -0400, Guido van Rossum wrote:
You might be interested to know that over in GCC land we're changing the C++ front end to use a hand-written recursive descent parser. It's not done yet, but we expect it to be easier to maintain, faster, and better at generating diagnostics than the existing yacc-based parser. zw

"ZW" == Zack Weinberg <zack@codesourcery.com> writes:
ZW> You might be interested to know that over in GCC land we're ZW> changing the C++ front end to use a hand-written recursive ZW> descent parser. It's not done yet, but we expect it to be ZW> easier to maintain, faster, and better at generating diagnostics ZW> than the existing yacc-based parser. LCC also uses a hand-written recursive descent parser, for exactly the reasons you mention. Thought I'd also mention a neat new paper about an old algorithm for recursive descent parsers with backtracking and unlimited lookahead. Packrat Parsing: Simple, Powerful, Lazy, Linear Time, Bryan Ford. ICFP 2002 http://www.brynosaurus.com/pub.html Jeremy

On Wednesday 21 August 2002 09:50 pm, Zack Weinberg wrote: ...
Interesting! This reminds me of a long-ago interview with Borland's techies about how they had managed to create Turbo Pascal, which ran well in a 64K (K, not M-) one-floppy PC, when their competitor, Microsoft Pascal, forced one to do a lot of disc-jockeying even with 256K and 2 floppies. Basically, their take was "we just did everything by the Dragon Book -- except that the parser is a hand-written recursive descent parser [Aho &c being adamant defenders of Yacc & the like], which buys us a lot" ... Alex

Alex Martelli <aleax@aleax.it>:
Even more impressive was the earlier version of Turbo Pascal which ran on 64K Z80-based CP/M systems! I have great respect for that one, because in a previous life I used it to develop a cross-compiler for a Modula-2-like language targeting the National 32000 architecture. My compiler consisted of 3 overlays (for parsing, declaration analysis and code generation), wasn't very fast, and had so little memory left for a symbol table that it could only compile very small modules. :-( In hindsight, my undoing was probably my insistence that the language not require forward declarations (it was my language, so I could make it how I wanted). If I had relaxed that, I could have used a single- pass design that would have simplified things considerably. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Thu, Aug 22, 2002, Greg Ewing wrote:
s/64K/48K/ I believe you could actually theoretically start it up with only 32K of memory, but you couldn't do any real work. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

"GvR" == Guido van Rossum <guido@python.org> writes:
GvR> Another note for Aahz: when it comes to scanning data that's GvR> not really a programming language, e.g. email messages, the GvR> words parsing, scanning, lexing and tokenizing are often used GvR> pretty much interchangeably. True, although even stuff like email messages are defined by a formal grammar, i.e. RFC 2822. email.Generator of course doesn't strictly use that grammar because it's trying to allow a much greater leniency in its input than a language compiler would. But note that approaches like Emacs's mail-extr.el package do in fact try to do more strict parsing based on the grammar. -Barry

On Wed, 21 Aug 2002, Guido van Rossum wrote:
As per Zach's comments, I think this is pretty funny. I have just spent more time trying to expose pgen to the interpreter than I took to write a R-D parser for Python 1.3 (granted, once Fred's parser module came around, I felt a bit silly). Considering the scope of my parser generator integration wishlist, having GCC move to a hand coded recursive descent parser is going to make my head explode. Even TenDRA (http://www.tendra.org/) used a LL(n) parser generator, despite its highly tweaked lookahead code. So now I'm going to have to extract grammars from call trees? As if the 500 languages problem isn't already intractable, there are going to be popular language implementations that don't even bother with an abstract syntax specificaiton!? (Stop me from further hyperbole if I am incorrect.) No wonder there are no killer software engineering apps. Maybe I should just start writing toy languages for kids... *smirk* -Jon

[Jonathan Riehl]
It seems a potential lesson went unlearned then <wink>.
Anyone writing an R-D parser by hand without a formal grammer to guide them is insane. The formal grammar likely won't capture everything, though -- but then they never do.
No wonder there are no killer software engineering apps. Maybe I should just start writing toy languages for kids...
Parser generators are great for little languages! They're painful for real languages, though, because syntax warts accumulate and then tool rigidity gets harder to live with. Hand-crafted R-D parsers are wonderfully tweakable in intuitive ways (staring at a mountain of parse-table conflicts and divining how to warp the grammar to shut the tool up is a black art nobody should regret not learning ...). 15 years of my previous lives were spent as a compiler jockey, working for HW vendors. The only time we used a parser generator was the time we used one written by a major customer, and for political reasons far more than technical ones. It worked OK in the end, but it didn't really save any time. It did save us from one class of error. I vividly recall a bug report against the previous Fortran compiler, where this program line (an approximation) CONTRAST = KNOB ** ROTATION apparently never got executed. It appeared to be an optimization bug at a fundamental level, as there was simply no code generated for this statement. After too much digging, we found that the guy who wrote the Fortran parser had done the equivalent of if not statement.has_label() and statement.startswith('CONT'): pass # an unlabelled CONTINUE statement can be ignored It's just that nobody had started a variable name with those 4 letters before. Yikes! I was afraid to fly for a year after <wink>. a-class-of-tool-most-appreciated-when-it's-least-needed-ly y'rs - tim

tim wrote:
cf. http://compilers.iecc.com/comparch/article/00-12-106 "For me and C++, [using a parser generator] was a bad mistake." </F>

Aahz <aahz@pythoncraft.com>:
Can someone suggest a good simple reference on the distinctions between parsing / lexing / tokenizing
Lexical analysis, otherwise known as "lexing" or "tokenising", is the process of splitting the input up into a sequence of "tokens", such as (in the case of a programming language) identifiers, operators, string literals, etc. Parsing is the next higher level in the process, which takes the sequence of tokens and recognises language constructs -- statements, expressions, etc.
particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology?
The lexing and parsing part of compiler technology isn't really any more arcane than it is for XML or anything else -- exactly the same principles apply. It's more a matter of how deeply you want to get into the theory. The standard text on this stuff around here seems to be Aho, Hopcroft and Ullman, "The Theory of Parsing, Translation and Compiling", but you might find that a bit much if all you want to do is parse XML. It will, however, give you a good grounding in the theory of REs, various classes of grammar, different parsing techniques, etc., after which writing an XML parser will seem like quite a trivial task. :-) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On 21 Aug 2002 at 10:34, Fredrik Lundh wrote:
mxTextTools lets (encourages?) you to break all the rules about lex -> parse. If you can (& want to) put a good deal of the "parse" stuff into the scanning rules, you can get a speed advantage. You're also not constrained by the rules of BNF, if you choose to see that as an advantage :-). My one successful use of mxTextTools came after using SPARK to figure out what I actually needed in my AST, and realizing that the ambiguities in the grammar didn't matter in practice, so I could produce an almost-AST directly. -- Gordon http://www.mcmillan-inc.com/

[Gordon McMillan]
I don't expect anyone will have much luck writing a fast lexer using mxTextTools *or* Python's regexp package unless they know quite a bit about how each works under the covers, and about how fast lexing is accomplished by DFAs. If you know both, you can build a DFA by hand and painfully instruct mxTextTools in the details of its construction, and get a very fast tokenizer (compared to what's possible with re), regardless of the number of token classes or the complexity of their definitions. Writing to mxTextTools directly is a lot like writing in an assembly language for a character-matching machine, with all the pains and potential joys that implies. If I were Eric, I'd use Flex <wink>.

Tim Peters wrote:
FYI, there are a few meta languages to make life easier for mxTextTools like e.g. Mike Fletcher's SimpleParse. The upcoming version 2.1 will also support Unicode and allows text jump targets which boosts readability of the tag tables a lot and makes hand-writing the tables much easier. The beta of 2.1 is available to the subscribers of the egenix-users mailing list. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[Eric S. Raymond]
... Lexers are painful in Python.
This is so.
They hit the language in a weak spot created by the immutability of strings.
But you lost me here -- I don't see a connection between immutability and either ease of writing lexers or speed of lexers. Indeed, lexers are (IME) more-or-less exactly as painful and slow written in Perl, where strings are mutable. Seems more to me that lexing is convenient and fast only when expressed in a language specifically designed for writing lexers, and backed by a specialized execution engine that knows a great deal about fast state-machine implementation. Like, say, Flex. Lexing is also clumsy and slow in SNOBOL4 and Icon, despite that they excel at higher-level pattern-matching tasks. IOW, lexing is in a world by itself, almost nothing is good at it, and the few things that shine at it don't do anything else. lexing-is-the-floating-point-execution-unit-of-the-language-world-ly y'rs - tim

tor 2002-08-22 klockan 03.13 skrev Tim Peters:
I've actually found that Haskell's pattern matching and lazy evaluation makes it pretty easy to write lexers. Too bad it's too hard to use Haskell together with other languages :( But then, writing a R-D parser in Haskell is piece-of-cake too :-) Martin

Tim Peters <tim.one@comcast.net>:
It's an implementation problem. You find yourself doing a lot of string accessing and pasting, creating several new objects per input char. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

"Eric S. Raymond" <esr@thyrsus.com>:
Not necessarily! Plex manages to do it without any of that. The trick is to leave all the characters in the input buffer and just *count* how many characters make up the next token. Once you've decided where the token ends, one slice gives it to you. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[Greg Ewing]
Plex is very nice! It doesn't pass my "convient and fast" test only because the DFA at the end still runs at Python speed, and one character at a time is still mounds slower than it could be in C. Hmm. But you can also generate pretty reasonable C code from Python source now too! You're going to solve this yet, Greg. Note that mxTextTools also computes slice indices for "tagging", rather than build up new string objects. Heck, that's also why Guido (from the start) gave the regexp and string match+search gimmicks optional start-index and end-index arguments too, and why one of the "where did this group match?" flavors returns slice indices. I think Eric has spent too much time debugging C lately <wink>.

Hmm. But you can also generate pretty reasonable C code from Python source now too! You're going to solve this yet, Greg.
Yes, probably the first serious use I make of Pyrex will be to re-implement the inner loop of Plex so it runs at C speed. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[attribution lost]
[Greg Ewing]
[/F]
you can do that without even looking at the characters?
1-character strings are shared; string[i] doesn't create a new object except for the first time that character is seen. string_item() in particular uses the character found to index into an array of 1-character string objects.

you can do that without even looking at the characters?
No, but the original complaint was that immutability of strings made lexing difficult. I was pointing out that it's possible to do it without mutating anything per-character. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

I haven't given up on the re module for fast scanners (see Tim's note on the speed of tokenizing 20,000 messages in mere minutes). Note that the Bayes approach doesn't *need* a trick to apply many regexes in parallel to the text. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido> I haven't given up on the re module for fast scanners (see Tim's Guido> note on the speed of tokenizing 20,000 messages in mere minutes). Guido> Note that the Bayes approach doesn't *need* a trick to apply many Guido> regexes in parallel to the text. Right. I'm thinking of it in situations where you do need such tricks. SpamAssassin is one such place. I think Eric has an application (quickly tokenizing the data produced by an external program, where the data can run into several hundreds of thousands of lines) where this might be beneficial as well. Skip

[Skip Montanaro]
This problem was also vivid in `procmail'-based SPAM filtering, as I observed it many years ago, and I remembered having quite lurked on the side of Flex at the time. I finally solved my own problem by writing a Python program able to sustain thousands of specialised rules rather speedily, proving once again that algorithmic approaches are often much slower than languages :-).
For a pure Python solution, PLEX could also be an avenue. It compiles a fast automaton, similar to what Flex does, from a grammar describing all tokens. I tried PLEX recently and was satisfied, even if not astonished by the speed. Also wanting fast parsing, I first used various Python heuristics, but the result was not solid enough to my taste. So I finally opted for Bison, and once on that road, it was just natural to rewrite the PLEX part in Flex. [Guido van Rossum]
I think you're exaggerating the problem, or at least underestimating the re module. The re module is pretty fast!
There are limits to what a backtracking matcher could do, speed-wise, when there are many hundreds of alternated patterns. [Eric S. Raymond]
Maybe lexing matches a grammar to a point, generates an event according to the match, advances the cursor and repeats until the end-of-file is met. Typically, parsing matches a grammar at point and does not repeat. In some less usual applications, they may be successive lexing stages, each taking its input from the output of the previous one. Parsing may be driven in stages too. I guess that we use the word `lexer' when the output structure is more flat, and `parser' when the output structure is more sexy! There are cases where the distinction is almost fuzzy. [Skip Montanaro]
Heap queues could be useful here as well. Consider you have hundreds of regexps to match in parallel on a big text. When the regexp is not too complex, it is more easy or likely that there exists a fast searcher for it. Let's build one searcher per regexp. Always beginning from the start of text, find the spot where each regexp first matches. Build a heap queue using the position as key, and both the regexp and match data as value. The top of the heap is the spot of your first match, process it, and while removing it from the heap, search forward from that spot for the same regexp, and add any result back to the heap. Merely repeat until the heap is empty. I'm not fully sure, but I have the intuition the above could be advantageous. -- François Pinard http://www.iro.umontreal.ca/~pinard

[François Pinard]
Sorry, my English is so unclear! I meant that people sometimes say Python is slow, yet because it allows clearer algorithms, one ends up with more speed in Python than other solutions in supposedly faster languages. For these other solutions, the slowness comes from the algorithms. People are often tempted to benchmark languages, while they should rather benchmark ideas. :-) -- François Pinard http://www.iro.umontreal.ca/~pinard

[Eric S. Raymond]
Hi, Paul! I believe Eric copied you on some concerns I had about the underpinnings of the algorithm, specifically about the final "is it spam?" computation. Looking at your links, I bet you got the formula from here: http://www.mathpages.com/home/kmath267.htm If so, the cause of the difficulty is that you inherited a subtle (because unstated) assumption from that writeup: I would suggest that we assume symmetry between "y" and "n". In other words, assume that probability of predicting correctly is the same regardless of whether the correct answer is "y" or "n". That part's fine. This implies p0=p7, p1=p6, p2=p5, and p3=p4, But that part doesn't follow from *just* the stated assumptions: note that those four equalities imply that p0+p2+p4+p6 = p7+p5+p3+p1 But the left-hand side of that is the probability that event X does not occur (it's all the rows with 'n' in the 'R' column), and the right-hand side is the probability that event X does occur (it's all the rows with 'y' in the 'R' column). In other words, this derivation also makes the stronger-- and unstated --assumption that X occurs with probability 1/2. The ultimate formula given on that page is correct if P(X)=0.5, but turns out it's wrong if P(X) isn't 0.5. Reality doesn't care how accurate Smith and Jones are, X occurs with its own probability regardless of what they think. Picture an extreme: suppose reality is such that X *always* occurs. Then p0 must be 0, and so must p2, p4 and p6 (the rows with 'n' in the R column can never happen if R is always 'y'). But then p0+p2+p4+p6 is 0 too, and the equality above implies p7+p5+p3+p1 is also 0. We reach the absurd conclusion that if X always occurs, the probability that X occurs is 0. As approximations to 1 go, 0 could stand some improvement <wink>. The math is easy enough to repair, but it may percolate into other parts of your algorithm. Chiefly, I *suspect* you found you needed to boost the "good count" by a factor of 2 because you actually have substantially more non-spam than spam in your inbox, and the scoring step was favoring "spam" more than it should have by virtue of neglecting to correct for that your real-life P(spam) is significantly less than 0.5 (although your training corpora had about the same number of spams as non-spams, so that P(spam)=0.5 was aprroximately true across your *training* data -- that's another complication). Makes sense? Once our testing setup is trustworthy, I'll try it both ways and report on results. In the meantime, it's something worth pondering.

[Paul Prescod]
Some perhaps relevant links (with no off-topic discusssion):
Damn -- wish I'd read that before. Among other things, Eric found a good use for Judy arrays <wink>.
Knew about that. Good stuff.
http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-1259 32.news.dfncis.de
Seems confused, assuming Graham's approach is a minor variant of ifile's. But Graham's computation is to classic Bayesian classifiers (like ifile) as Python's lambda is to Lisp's <0.7 wink>. Heart of the confusion: Integrating the whole set of statistics together requires adding up statistics for _all_ the words found in a message, not just the words "sex" and "sexy." The rub is that Graham doesn't try to add up the statistics for all the words found in a msg. To the contrary, it ends up ignoring almost all of the words. In particular, if the database indicates that "sex" and "sexy" aren't good spam-vs-non-spam discriminators, Graham's approach ignores them completely (their presence or absence doesn't affect the final outcome at all -- it's like the words don't exist; this isn't what ifile does, and ifile probably couldn't get away with this because it's trying to do N-way classification instead of strictly 2-way -- someone who understands the math and reads Graham's article carefully will likely have a hard time figuring out what Bayes has to do with it at all! I sure did.).
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
In ifile I believe that. But the data will speak for itself soon enough, so I'm not going to argue about this.

Tim Peters <tim.one@comcast.net>:
It's a freaking *ideal* use for Judy arrays. Platonically perfect. They couldn't fit better if they'd been designed for this application. Bogofilter was actually born in the moment that I realized this. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

[Eric S. Raymond]
I believe that so long as it stays in memory. But, as you mention in your manpage startup is too slow for sites handling thousands of mails an hour That likely makes a Zope OOBTree stored under ZODB a better choice still, as that's designed for efficient update and access in a persistent database (the version of this we've got now does update during scoring, to keep track of when tokens were last used, and how often they've proved useful in discriminating -- there needs to be a way to expire tokens over time, else the database will grow without bound). I've corresponded with Douglas Baskins about "this kind of thing", and he's keen to address it (along with every other problem in the world <0.9 wink>); it would help if HP weren't laying off the people who have worked on this code.

Tim Peters <tim.one@comcast.net>:
VM, dude, VM is your friend. I thought this through carefully. The size of bogofilter's working set isn't limited by core. And because it's a B-tree variant, the access frequency will be proportional to log2 of the wordlist size and the patterns will be spatially bursty. This is a memory access pattern that plays nice with an LRU pager.
I'm working on a simpler solution, one which might have a Pythonic spinoff. Stay tuned. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

[Eric S. Raymond]
VM, dude, VM is your friend. I thought this through carefully. The size of bogofilter's working set isn't limited by core.
And because it's a B-tree variant, the access frequency will be
Do you expect that to be an issue? When I built a database from 20,000 messages, the whole thing fit in a Python dict consuming about 10MB. That's all. As is always the case in this kind of thing, about half the tokens are utterly useless since they appear only once in only one message (think of things like misspellings and message ids here -- about half the tokens generated will be "obviously useless", although the useless won't become obvious until, e.g., a month passes and you never seen the token again). In addition, this is a statistical-sampling method, and 20,000 messages is a very large sample. I concluded that, in practice, and since we do have ways to identify and purge useless tokens, 5MB is a reasonable upper bound on the size of this thing. It doesn't fit in my L2 cache, but I'd need a magnifying glass to find it in my RAM. proportional
to log2 of the wordlist size
I don't believe Judy is faster than Python string-keyed dicts at the sizes I'm expecting (I've corresponded with Douglas about that too, and his timing data has a hard time disagreeing <wink>).
and the patterns will be spatially bursty.
Why? Do you sort tokens before looking them up? Else I don't see a reason to expect that, from one lookup to the next, the paths from root to leaf will enjoy spatial overlap beyond the root node.
This is a memory access pattern that plays nice with an LRU pager.
Well, as I said before, all the evidence I've seen says the scoring time for a message is near-trivial (including the lookup times), even in pure Python. It's only the parsing that I'm still worried about, and I may yet confess a bad case of Flex envy.
I figure this means something simpler than a BTree under ZODB. If so, you should set yourself a tougher challenge <wink>.

Tim Peters <tim_one@email.msn.com>:
Hm, that's a bit smaller than I would have thought, but the order of magnitude I was expecting.
Recognition features should age! Wow! That's a good point! With the age counter being reset when they're recognized.
and the patterns will be spatially bursty.
Why? Do you sort tokens before looking them up?
I thought part of the point of the method was that you get sorting for free because of the way elements are inserted.
No, but think about how the pointer in a binary search moves. It's spatially bursty, Memory accesses frequencies for repeated binary searches will be a sum of bursty signals, analogous to the way network traffic volumes look in the time domain. In fact the graph of memory adress vs. number of accesses is gonna win up looking an awful lot like 1/f noise, I think. *Not* evenly distributed; something there for LRU to weork with.
What I'm starting to test now is a refactoring of the program where it spawn a daemon version of itself first time it's called. The daemon eats the wordlists and stays in core fielding requests from subsequent program runs. Basically an answer to "how you call bogofilter 1K times a day from procmail without bringing your disks to their knees" problem" -- persistence on the cheap. Thing is that the solution to this problem is very generic. Might turn into a Python framework. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

On Wed, Aug 21, 2002 at 01:35:56AM -0400, Eric S. Raymond wrote:
For use at ISPs, the daemon should be able to field requests from lots of different users, maintaining one unified word list. Without needing any access whatsoever to user home directories. zw

Zack Weinberg <zack@codesourcery.com>:
I'm on it. The following is not yet working, but it's a straight road to get there.... There is a public spam-checker port. Your client program sends it packets consisting of a list of header token counts. You can send lots of these blocks; each one has to be under the maximum atomic-message size for sockets (I think that's 32K). The server accumulates the frequency counts you ship it until you say "OK, what is it?" Does the Bayes test. Ships you back a result. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

On Wed, Aug 21, 2002 at 02:22:26AM -0400, Eric S. Raymond wrote:
My ISP-postmaster friend's reaction to that: | As far it it goes, yes. How would it learn? | | On a more mundane note, I'd like to see decoding of base64 in it. | | (Oh, and on a blue-sky note, has anyone taken up Graham's suggestion | of having one of these things that looks at word pairs instead of | words?) | | It's neat that ESR saw immediately that the daemon should be | self-contained, no access to home directories. SpamAssassin doesn't | have a simple way of doing that, and [ISP] is modifying it to have | one -- and you wouldn't believe the resistance to the proposed | changes from some of the SA developers. Some of them really seem | to think that it's better and simpler to store user configuration | in a database than to have the client send its config file to the | server along with each message. I remember you said you didn't want to do base64 decode because it was too slow? zw

Zack Weinberg <zack@codesourcery.com>:
My ISP-postmaster friend's reaction to that:
| As far it it goes, yes. How would it learn?
Your users' mailers would have two delete buttons -- spam and nonspam. On each delete the message would be shipped to bogofilter, which would would merge the content into its token lists.
I remember you said you didn't want to do base64 decode because it was too slow?
And not necessary. Base64 spam invariably has telltales that Bayesian amalysis will pick up in the headers and MIME cruft. A rather large percentage of it is either big5 or images. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

I'd be curious to know if that will continue to be true in the future. At least one of my non-tech friends sends email that's exclusively HTML (even though the content is very lightly marked-up plain text), from a hotmail account. Spam could easily have the same origin, but the HTML contents would be very different. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org>:
Well, consider. If your friend were to send you base64 mail, it probaby would *not* come from one of the spamhaus addresses in bogofilter's wordlists. The presence of base64 content is neutral. That means that about the only way not decoding it could lead to a false positive is if the headers contained spam-correlated tokens which decoding the body would have countered with words having a higher non-spam loading. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

Yeah, but not every spammer sends from a well-known spammer's address.
Graham mentions the possibility that spammers can develop ways to make their headers look neutral. When I receive a base64-encoded HTML message from Korea whose subject is "Hi", it could be from a Korean Python hacker (there were 700 of those at a conference Christian Tismer attended in Korea last year, so this is a realistic example), or it could be Korean spam. Decoding the base64 would make it obvious. The headers usually give some clues, but based on what makes it through SpamAssassin (which we've been running for all python.org mail since February or so), base64 encoding scores high on the list of false negatives. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org>:
Noted. I'll take account of that in my planning. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

"ESR" == Eric S Raymond <esr@thyrsus.com> writes:
>> My ISP-postmaster friend's reaction to that: | As far it it >> goes, yes. How would it learn? ESR> Your users' mailers would have two delete buttons -- spam and ESR> nonspam. On each delete the message would be shipped to ESR> bogofilter, which would would merge the content into its ESR> token lists. You need some kind of list admin oversight or your system is open to attack vectors on individual posters. -Barry

Barry A. Warsaw <barry@python.org>:
You need some kind of list admin oversight or your system is open to attack vectors on individual posters.
Interesting point! -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

[Eric S. Raymond]
I want to raise a caution here. Graham pulled his formulas out of thin air, and one part of the scoring step is quite dubious. This requires detail to understand. Where X means "word x is present" and similarly for Y, and S means "it's spam" and "not-S" means "it's not spam", and sticking to just the two-word case for simplicity: P(S | X and Y) = [by Bayes] P(X and Y | S) * P(S) / P(X and Y) = [by the usual expanded form of Bayes] P(X and Y | S) * P(S) / (P(S)*P(X and Y | S) + P(not-S)*P(X and Y | not-S)) All that is rigorously correct so far. Now we make the simplifying assumption that puts the "naive" in "naive Bayes", that the probability of X is independent of the probability of Y, so that the conjoined probabilities can be replaced by multiplication of non-conjoined probabilities. This yields P(X|S)*P(Y|S)*P(S) --------------------------------------------------- P(S)*P(X|S)*P(Y|S) + P(not-S)*P(X|not-S)*P(Y|not-S) Then, unlike a "normal" formulation of Bayesian classification, Graham's scheme simply doesn't know anything about P(X|S) and P(Y|S) etc. It only knows about probabilities in the other direction (P(S|X) etc). It takes 3 more applications of Bayes to get what we want from what we know. That is, P(X|S) = [again by Bayes] P(S|X) * P(X) / P(S) Plug that in, mutatis mutandis, in six places, to get P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S)*P(S) --------------------------------------------------- P(S)*P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S) + ... P(not-S)*P(not-S|X)*P(X)/P(not-S)*P(not-S|Y)*P(Y)/P(not-S) The factor P(X)*P(Y) cancels out of numerator and denominator, leaving P(S|X)*/P(S)*P(S|Y)*/P(S)*P(S) --------------------------------------------------- P(S)*P(S|X)/P(S)*P(S|Y)/P(S) + ... P(not-S)*P(not-S|X)/P(not-S)*P(not-S|Y)/P(not-S) and simplifying some P(whatever)/P(whatever) instances away gives P(S|X)*P(S|Y)/P(S) --------------------------------------------------- P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) This isn't what Graham computes, though: the P(S) and P(not-S) terms are missing in his formulation. Given that P(not-S) = 1-P(S), and P(not-S|whatever) = 1-P(S|whatever), what he actually computes is P(S|X)*P(S|Y) ------------------------------------- P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y) This is the same as the Bayesian result only if P(S) = 0.5 (in which case all the instances of P(S) and P(not-S) cancel out). Else it's a distortion of the naive Bayesian result. For this reason, it's best that the number of spam msgs fed into your database be approximately equal to the number of non-spam msgs fed into it: that's the only way to make P(S) ~= P(not-S), so that the distortion doesn't matter. Indeed, it may be that Graham found he had to multiply his "good counts" by 2 in order to make up for that in real life he has twice as many non-spam messages as spam messages in his inbox, but that the final scoring implicitly assumes they're of equal number (and so overly favors the "it's spam" outcome unless the math is fudged elsewhere to make up for that). It would likely be better still to train the database with a proportion of spam to not-spam messages reflecting what you actually get in your inbox, and change the scoring to use the real-life P(S) and P(not-S) estimates. In that case the "mystery bias" of 2 may actively hurt, overly favoring the "not spam" outcome. Note that Graham said: Here's a sketch of how I do statistical filtering. I start with one corpus of spam and one of nonspam mail. At the moment each one has about 4000 messages in it. That's consistent with all the above, although it's unclear whether Graham intended "about the same" to be a precondition for using this formulation, or whether fudging elsewhere was introduced empirically to make up for the scoring formula neglecting P(S) and P(not-S) by oversight.

This brings up a couple of questions, one related to the theory behind this Bayesian spam filtering, and one about Python optimization ... apologies in advance for the long post. Tim Peters <tim.one@comcast.net> wrote:
[detail deleted]
Is there an easy fix to this problem? I implemented this in Python after reading about it on the weekend, and it might explain why my results are not quite as fabulous as the author noted (I'm getting more false positives than he claimed he was). Note that I'm not so good with the above notation; I'm more at home with plain algebraic stuff :). But the more interesting Python question: I'm running into some performance problems with my implementation. Details: The analysis stage of my implementation (I'll refer to it as "spamalyzer" for now) stores the "mail corpus" and term list on disk. The mail corpus is two dictionaries (one for spam, one for good mail), each of which contains two further dictionaries -- one is the filenames of analyzed messages (one key per filename, values ignored and stored as 0), and the other is a dictionary mapping terms to the number of occurrences. The terms list is a single dictionary mapping terms to a pair of floats (probability of being spam and distance from 0.5). My first try at this used cPickle to store these items, but loading them back in was excruciatingly slow. From a lightly loaded P3-500/128MB running Linux 2.2.x, each of these is a separate run of a benchmarking Python script: Loading corpus -------------- pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 65.190000000000 seconds. pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 64.790000000000 seconds. pickle method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 65.010000000000 seconds. Loading terms ------------- pickle method: got 12986 terms in 3.460000000000 seconds. pickle method: got 12986 terms in 3.470000000000 seconds. pickle method: got 12986 terms in 3.450000000000 seconds. For a lark, I decided to try an alternative way of storing the data (and no, I haven't tried the marshal module directly). I wrote a function to write the contents of the dictionary to a text file in the form of Python source, so that you can re-load the data with a simple "import" command. To my surprise, this was significantly faster! The first import, of course, takes a while, as the interpreter compiles the .py file to .pyc format, but subsequent runs are an order of magnitude faster than cPickle.load(): Loading corpus -------------- [charlesc@charon spamalyzer]$ rm mail_corpus.pyc custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 194.210000000000 seconds. custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 3.500000000000 seconds. custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 3.260000000000 seconds. custom method: good (1014 files, 289182 terms), spam (156 files, 14089 terms) in 3.260000000000 seconds. Loading terms ------------- [charlesc@charon spamalyzer]$ rm terms.pyc custom method: got 12986 terms in 3.110000000000 seconds. custom method: got 12986 terms in 0.210000000000 seconds. custom method: got 12986 terms in 0.210000000000 seconds. custom method: got 12986 terms in 0.210000000000 seconds. So the big question is, why is my naive "o = __import__ (f, {}, {}, [])" so much faster than the more obvious "o = cPickle.load (f)"? And what can I do to make it faster :). Charles -- ----------------------------------------------------------------------- Charles Cazabon <python@discworld.dyndns.org> GPL'ed software available at: http://www.qcc.ca/~charlesc/software/ -----------------------------------------------------------------------

Charles> So the big question is, why is my naive "o = __import__ (f, {}, Charles> {}, [])" so much faster than the more obvious "o = cPickle.load Charles> (f)"? And what can I do to make it faster :). Try dumping in the binary format, e.g.: s = cPickle.dumps(obj, 1) Skip

[Charles Cazabon]
Is there an easy fix to this problem?
I don't know that there is "a problem". The step is dubious, but other steps are also dubious, and naive Bayes itself is dubious (the assumption that word pobabilities are independent is simply false in this application). But outside of, perhaps, quantum chromodynamics, all models of reality are more or less dubious, and it's darned hard to say whether the fudging needed to make them appear to work is lucky or principled, robust or brittle. The more gross deviations there are from a model, though, the less one can appeal to the model for guidance. In the limit, you can end up with a pile of arbitrary tricks, with no idea which gimmicks matter anymore (given enough free parameters to fiddle, you can train even a horribly inappropriate model to fit a specific blob of data exactly).
How many lines of code do you have? That's a gross lower bound on the number of places it might have screwed up <wink>.
Note that I'm not so good with the above notation; I'm more at home with plain algebraic stuff :).
It's all plain-- and simple ---algebra, it's just long-winded. You may be confusing yourself, e.g., by reading P(S|X) as if it's a complex expression in its own right. But it's not -- given an incarnation of the universe, it denotes a fixed number. Think of it as "w" instead <wink>. Let's get concrete. You have a spam corpus with 1000 messages. 100 of them contain the word x, and 500 contain the word y. Then P(X|S) = 100/1000 = 1/10 P(Y|S) = 500/1000 = 1/2 You also have a non-spam corpus with 2000 messages. 100 of them contain x too, and 500 contain y. Then P(X|not-S) = 100/2000 = 1/20 P(Y|not-Y) = 500/2000 = 1/4 This is the entire universe, and it's all you know. If you pick a message at random, what's P(S) = the probability that it's from the spam corpus? It's trivial: P(S) = 1000/(1000+2000) = 1/3 and P(not-S) = 2/3 Now *given that* you've picked a message at random, and *know* it contains x, but don't know anything else, what's the probability it's spam (== what's P(S|X)?). Well, it has to be one of the 100 spam messages that contains x, or one of the 100 non-spam messages that contains x. They're all equally likely, so P(S|X) = (100+100)/200 = 1/2 and P(S|Y) = (500+500)/500 = 1/2 too by the same reasoning. P(not-S|X) and P(not-S|Y) are also 1/2 each. So far, there's nothing a reasonable person can argue with. Given that this is our universe, these numbers fall directly out of what reasonable people agree "probability" means. When it comes to P(S|X and Y), life is more difficult. If we *agree* to assume that word probabilities are independent (which is itself dubious, but has the virtue of appearing to work pretty well anyway), then the number of messages in the spam corpus we can expect to contain both X and Y is P(X|S)*P(Y|S)*number_spams = (1/10)*(1/2)*1000 = 50 Similarly the # of non-spam messages we can expect to contain both X and Y is (1/20)*(1/4)*2000 = 25 Since that's all the messages that contain both X and Y, the probability that a message containing both X and Y is spam is P(S | X and Y) = 50/(50 + 25) = 2/3 Note that this agrees with the formula whose derivation I spelled out from first principles: P(S|X)*P(S|Y)/P(S) --------------------------------------------------- = P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S) (1/2)*(1/2)/(1/3) 2 -------------------------------------- = - (1/2)*(1/2)/(1/3) + (1/2)*(1/2)/(2/3) 3 It's educational to work through Graham's formulation on the same example. To start with, P(S|X) is approximated by a different means, and fudging the "good count" by a factor of 2, giving P'(S|X) = (100/1000) / (100/1000 + 2*100/2000) = 1/2 and similarly for P'(S|Y). These are the same probabilities I gave above, but the only reason they're the same is because I deliberately picked a spam corpus size exactly half the size of the non-spam corpus, knowing in advance that this factor-of-2 fudge would make the results the same in the end. The only difference in what's computed then is in the scoring step, where Graham's formulation computes P'(S | X and Y) = (1/2)*(1/2)/((1/2)*(1/2)+(1/2)*(1/2)) = 1/2 instead of the 2/3 that's actually true in this universe. If the corpus sizes diverge more, the discrepancies at the end grow too, and the way of computing P(S|X) at the start also diverges. Is that good or bad? I say no more now than that it's dubious <wink>.
But the more interesting Python question: I'm running into some performance problems with my implementation. Details:
English never helps with these. Whittle it down and post actual code to comp.lang.python for help. Or study the sandbox code in the CVS repository (see the Subject line of this msg) -- it's not having any speed problems.

[Tim]
And note that only an unreasonable person would argue that (100+100)/200 = 1, so don't even think about it <wink>. Of course those should have been P(S|X) = 100/(100+100) = 1/2 and P(S|Y) = 500/(500+500) = 1/2 There are probably other glitches like that. Work out the example yourself -- it's much easier to figure it out than to transcribe all the tedious numbers into a msg.

Tim Peters <tim.one@comcast.net>:
OK. So, maybe I'm just being stupid, but this seems easy to solve. We already *have* estimates of P(S) and P(not-S) -- we have a message count associated with both wordlists. So why not use the running ratios between 'em? As long as we initialize with "good" and "bad" corpora that are approximately the same size, the should work no worse than the equiprobability assumption. The ratios will correct in time based on incoming traffic. Oh, and do you mind if I use your algebra as part of bogofilter's documentation? -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

[Eric S. Raymond]
a. There are other fudges in the code that may rely on this fudge to cancel out, intentionally or unintentionally. I'm loathe to type more about this instead of working on the code, because I've already typed about it. See a later msg for a concrete example of how the factor-of-2 "good count" bias acts in part to counter the distortion here. Take one away, and the other(s) may well become "a problem". b. Unless the proportion of spam to not-spam in the training sets is a good approximation to the real-life ratio of spam to not- spam, it's also dubious to train the system with bogus P(S) and P(not-S) values. c. I'll get back to this when our testing infrastructure is trustworthy. At the moment I'm hosed because the spam corpus I pulled off the web turns out to be trivial to recognize in contrast to Barry's corpus of good msgs from python.org mailing lists: every msg in the spam corpus has stuff about the fellow who collected the spam in the headers, while nothing in the python.org corpus does; contrarily, every msg in the python.org corpus has python.org header info not in the spam corpus headers. This is an easy way to get 100% precision and 100% recall, but not particularly realistic -- the rules it's learning are of the form "it's spam if and only if it's addressed to bruceg"; "t's not spam if and only if the headers contain 'List-Unsubscribe'"; etc. The learning can't be faulted, but the teacher can <wink>. d. I only exposed the math for the two-word case above, and the generalization to n words may not be clear from the final result (although it's clear enough if you back off a few steps). If there are n words, w[0] thru w[n-1]: prod1 <- product for i in range(n) of P(S|w[i])/P(S) prod2 <- product for i in range (n) of (1-P(S|w[i])/(1-P(S)) result <- prod1*P(S) / (prod1*P(S) + prod2*(1-P(S))) That's if you're better set up to experiment now. If you do this, the most interesting thing to see is whether results get better or worse if you *also* get rid of the artificial "good count" boost by the factor of 2.
"not spam" is already being given an artificial boost in a couple of ways. Given that in real life most people still get more not-spam than spam, removing the counter-bias in the scoring math may boost the false negative rate.
The ratios will correct in time based on incoming traffic.
Depends on how training is done.
Oh, and do you mind if I use your algebra as part of bogofilter's documentation?
Not at all, although if you wait until we get our version of this ironed out, you'll almost certainly be able to pull an anally-proofread version out of a plain-text doc file I'll feel compelled to write <wink>.

Tim Peters <tim.one@comcast.net>:
I was thinking of shooting that "goodness bias" through the head and seeing what happens, actually. I've been unhappy with that fudge in Paul's original formula from the beginning.
Right -- which is why I want to experiment with actually *using* the real life running ratio.
Ouch. That's a trap I'll have to watch out for in handling other peoples' corpora. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

On Sat, Aug 24, 2002, Tim Peters wrote:
as of six months ago, i no longer believe this to be necessarily true -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

| As far it it goes, yes. How would it learn? I have some ideas about how you could hook this into Mailman to do community/membership assisted learning. Understanding that people will be highly motivated to inform you about spam but not about good messages, you essentially queue a copy of a random sampling of messages for a few days. Members can let the list admin know about leaked spam (via a url or -spam address, or whatever) and after the list admin verifies it, this trains the system on that spam. If no feedback on a message happens after a few days, you train the system on that known good message. You need list admin verification to avoid attack vectors (I get mad at Guido so I -- a normal user -- label all his messages as spam). | On a more mundane note, I'd like to see decoding of base64 in it. | | (Oh, and on a blue-sky note, has anyone taken up Graham's suggestion | of having one of these things that looks at word pairs instead of | words?) | | It's neat that ESR saw immediately that the daemon should be | self-contained, no access to home directories. SpamAssassin doesn't | have a simple way of doing that, and [ISP] is modifying it to have | one -- and you wouldn't believe the resistance to the proposed | changes from some of the SA developers. Some of them really seem | to think that it's better and simpler to store user configuration | in a database than to have the client send its config file to the | server along with each message.
"ZW" == Zack Weinberg <zack@codesourcery.com> writes:
ZW> I remember you said you didn't want to do base64 decode ZW> because it was too slow? But there might be some interesting, integrated ways around that. Say for example, you take a Python-enabled mail server, parse the message into its decoded form early (but not before low level SMTP-based rejections) and then pass that parsed and decoded message object tree around to all the other subsystems that are interested, e.g. the Bayes filter, and Mailman. You can at least amortize the cost of parsing and decoding once for the rest of the lifetime of that message on your system. I think we have all the pieces in place to play with this approach on python.org. -Barry

"ZW" == Zack Weinberg <zack@codesourcery.com> writes:
ZW> For use at ISPs, the daemon should be able to field requests ZW> from lots of different users, maintaining one unified word ZW> list. Without needing any access whatsoever to user home ZW> directories. An approach like this certainly makes sense for a mailing list server, especially when all the lists are roughly about the same topic. Even without that, I suspect that spam across lists all looks the same, while non-spam will differ so there may be list/site organizations you can exploit here. -Barry

[Tim]
Do you expect that to be an issue? When I built a database from 20,000 messages, the whole thing fit in a Python dict consuming about 10MB.
[Eric S. Raymond]
Hm, that's a bit smaller than I would have thought, but the order of magnitude I was expecting.
It's even smaller than that <wink>. The dict here maps strings to instances of a Python class (WordInfo). The latter uses new-in-2.2 __slots__, and those give major memory efficiences over old-style classes, but there's still subtantial memory overhead compared to what's possible in C. In addition, there are memory overheads for the Python objects stored in the WordInfo instances, including a Python float object in each record recording the time.time() of last access by the scoring method. IOW, there are tons of memory overheads here, yet the size was still minor. So I have no hesitation leaving this part in Python, and coding this part up was a trivial finger exercise. You know all that, though! It makes your decision to use C from the start hard to fathom.
For concreteness, here's the comment from the Python code, which I believe is accurate: # (*)atime is the last access time, a UTC time.time() value. It's the # most recent time this word was used by scoring (i.e., by spamprob(), # not by training via learn()); or, if the word has never been used by # scoring, the time the word record was created (i.e., by learn()). # One good criterion for identifying junk (word records that have no # value) is to delete words that haven't been used for a long time. # Perhaps they were typos, or unique identifiers, or relevant to a # once-hot topic or scam that's fallen out of favor. Whatever, if # a word is no longer being used, it's just wasting space. Besides the space-saving gimmick, there may be practical value in expiring older words that are getting used, but less frequently over time. That would be evidence that the nature of the world is changing, and more aggressively expiring the model for how the world *used* to be may speed adaptation to the new realities. I'm not saving enough info to do that, though, and it's unclear whether it would really help. Against it, while I see new spam gimmicks pop up regularly, the old ones never seem to go away (e.g., I don't do anything to try to block spam on my email accounts, and the bulk of the spam I get is still easily recognized from the subject line alone). However, because it's all written in Python <wink>, it will be very easy to set up experiments to answer such questions. BTW, the ifile FAQ gives a little info about the expiration scheme ifile uses. Rennie's paper gives more: Age is defined as the number of e-mail messages which have been added to the model since frequency statistics have been kept for the word. Old, infrequent words are to be dropped while young words and old, frequent words should be kept. One way to quantify this is to say that words which occur fewer than log2(age)-1 times should be discarded from the model. For example, if “baseball” occurred in the 1st document and occurred 5 or fewer times in the next 63 documents, the word and its corresponding statistics would be eliminated from the model’s database. This feature selection cutoff is used in ifile and is found to significantly improve efficiency without noticeably affecting classification performance. I'm not sure how that policy would work with Graham's scheme (which has many differences from the more conventional scheme ifile uses). Our Python code also saves a count of the number of times each word makes it into Graham's "best 15" list, and I expect that to be a direct measure of the value we're getting out of keeping a word ("word" means whatever the tokenizer passes to the classifier -- it's really any hashable and (for now) pickleable Python object). [on Judy]
I thought part of the point of the method was that you get sorting for free because of the way elements are inserted.
Sure, if range-search or final sorted order is important, it's a great benefit. I was only wondering about why you'd expect spatial locality in the input as naturally ordered.
Judy may or may not be able to exploit something here; I don't know, and I'd need to know a lot more about Judy's implementation to even start to guess. Plain binary search has horrid cache behavior, though. Indeed, most older research papers on B-Trees suggest that binary search doesn't buy anything in even rather large B-Tree nodes, because the cache misses swamp the reduced instruction count over what a simple linear search does. Worse, that linear search can be significantly faster if the HW is smart enough to detect the regular address access pattern of linear search and do some helpful prefetching for you. More recent research on B-Trees is on ways to get away from bad binary search behavior; two current lines are using explicit prefetch instructions to minimize the stalls, and using a more cache-friendly data structure inside a B-Tree node. Out-guessing modern VM implementations is damned hard. With a Python dict you're likely to get a cache miss per lookup. If that's a disaster, time.clock() isn't revealing it <wink>.
Sounds good! Just don't call it "compilerlike" <wink>.

"NS" == Neil Schemenauer <nas@python.ca> writes:
NS> Are you planning to check this into the sandbox? http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox... -Barry

[Tim]
the version of this we've got now does update during scoring
[Neil Schemenauer]
Are you planning to check this into the sandbox?
Update-during-scoring was already in the initial version. This works with a Python dict, though (which Barry pickles and unpickles across runs), not with a persistent database (like ZODB). Changes to use a ZODB BTree would be easy, but not yet most interesting to me. There are many more basic open questions, like which kinds of tokenization ("feature extraction") do and don't work. BTW, that's why the WordInfo records have a .killcount attribute -- the data will tell us which ways work best.

[Skip Montanaro]
Anybody up for pooling corpi (corpora?)?
Barry is collecting clean data from mailing-list archives for lists hosted at python.org. It's unclear that this will be useful for anything other than mailing lists hosted at python.org (which I expect have a lot of topic commonality). There's a lovely spam archive here: http://www.em.ca/~bruceg/spam/

"SM" == Skip Montanaro <skip@pobox.com> writes:
tim> Straight character n-grams are very appealing because they're tim> the simplest and most language-neutral; I didn't have any tim> luck with them over the weekend, but the size of my training tim> data was trivial. SM> Anybody up for pooling corpi (corpora?)? I've got collections from python-dev, python-list, edu-sig, mailman-developers, and zope3-dev, chopped at Feb 2002, which is approximately when Greg installed SpamAssassin. The collections are /all/ known good, but pretty close (they should be verified by hand). The idea is to take some random subsets of these, cat them together and use them as both training and test data, along with some 'net-available known spam collections. No time more to play with this today though... -Barry

Some perhaps relevant links (with no off-topic discusssion): * http://www.tuxedo.org/~esr/bogofilter/ * http://www.ai.mit.edu/~jrennie/ifile/ * http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-125932.news.df... """My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam." If you muddle together the Nigerian Pyramid schemes with the "Penis enhancement" ads along with the offers of new credit cards as well as the latest sites where you can talk to "hot, horny girls LIVE!", the statistics don't work out nearly so well. It's hard to tell, on the face of it, why Nigerian scams _should_ be considered textually similar to phone sex ads, and in practice, the result of throwing them all together" There are a few things left to improve about Ifile, and I'd like to redo it in some language fundamentally less painful to work with than C """ "Barry A. Warsaw" wrote:
-- Paul Prescod

Paul Prescod <paul@prescod.net>:
Some perhaps relevant links (with no off-topic discusssion):
I'm in the process of speed-tuning this now. I intend for it to be blazingly fast, usable for sites that process 100K mails a day, and I think I know how to do that. This is not a natural application for Python :-).
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
Well, except it seems to work quite well. The Nigerian trigger-word population is distinct from the penis-enlargement population, but they both show up under Bayesian analysis. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

[Eric S. Raymond]
I'm not sure about that. The all-Python version I checked in added 20,000 Python-Dev messages to the database in 2 wall-clock minutes. The time for computing the statistics, and for scoring, is simply trivial (this wouldn't be true of a "normal" Bayesian classifier (NBC), but Graham skips most of the work an NBC does, in particular favoring fast classification time over fast model-update time). What we anticipate is that the vast bulk of the time will end up getting spent on better tokenization, such as decoding base64 portions, and giving special care to header fields and URLs. I also *suspect* (based on a previous life in speech recogniation) that experiments will show that a mixture of character n-grams and word bigrams is significantly more effective than a "naive" tokenizer that just looks for US ASCII alphanumeric runs.
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
In fact, I'm going to say "Nigerian" and "penis enlargement" one more time each here, just to demonstrate that *this* message won't be a false positive when the smoke settles <wink>. Human Growth Hormone too, while I'm at it.

(Copied to Paul Graham. Paul, this is the mailing list of the Python maintainers. I thought you'd find the bits about lexical analysis in bogofilter interesting. Pythonistas, Paul is one of the smartest and sanest people in the LISP community, as evidenced partly by the fact that he hasn't been too proud to learn some lessons from Python :-). It would be a good thing for some bridges to be built here.) Tim Peters <tim.one@comcast.net>:
This is one of bogofilter's strengths. It already does this stuff at the lexical level using a speed-tuned flex scanner (I spent a lot of the development time watching token strings go by and tweaking the scanner rules to throw out cruft). In fact, look at this. It's a set of lex rules with interpolated comments: BASE64 [A-Za-z0-9/+] IPADDR [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ MIME_BOUNDARY ^--[^[:blank:]\n]*$ %% # Recognize and discard From_ headers ^From\ {return(FROM);} # Recognize and discard headers that contain dates and tumblers. ^Date:.*|Delivery-Date:.* ; ^Message-ID:.* ; # Throw away BASE64 enclosures. No point in using this as a discriminator; # spam that has these in it always has other triggers in the headers. # Therefore don't pay the overhead to decode it. ^{BASE64}+$ ; # Throw away tumblers generated by MTAs ^\tid\ .* ; SMTP\ id\ .* ; # Toss various meaningless MIME cruft boundary=.* ; name=\" ; filename=\" ; {MIME_BOUNDARY} ; # Keep IP addresses {IPADDR} {return(TOKEN);} # Keep wordlike tokens of length at least three [a-z$][a-z0-9$'.-]+[a-z0-9$] {return(TOKEN);} # Everything else, including all one-and-two-character tokens, # gets tossed. . ; This small set of rules does a remarkably effective job of tossing out everything that isn't a potential recognition feature. Watching the filtered token stream from a large spam corpus go by is actually quite an enlightening experience. It does a better job that Paul's original in one important respect; IP addresses and hostnames are preserved whole for use as recognition features. I think I know why Paul didn't do this -- he's not a C coder, and if lex/flex isn't part of one's toolkit one's brain doesn't tend to wander down design paths that involve elaborate lexical analysis, because it's just too hard. This is actually the first new program I've coded in C (rather than Python) in a good four years or so. There was a reason for this; I have painful experience with doing lexical analysis in Python that tells me flex-generated C will be a major performance win here. The combination of flex and Judy made it a no-brainer.
I share part of your suspicions -- I'm thinking about going to bigram analysis for header lines. But I'm working on getting the framework up first. Feature extraction is orthogonal to both the Bayesian analysis and (mostly) to the data-storage method, and can be a drop-in change if the framework is done right.
I think I know why, too. It's the top-15 selection -- the highest-variance words don't blur into non-spam English the way statistics on *all* tokens would. It's like an edge filter. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

Eric> This is one of bogofilter's strengths. It already does this stuff Eric> at the lexical level using a speed-tuned flex scanner (I spent a Eric> lot of the development time watching token strings go by and Eric> tweaking the scanner rules to throw out cruft). This reminds me of something which tickled my interesting bone the other day. The SpamAssassin folks are starting to look at Flex for much faster regular expression matching in situations where large numbers of static re's must be matched. I wonder if using something like SciPy's weave tool would make it (relatively) painless to incorporate fairly high-speed scanners into Python programs. It seems like it would just be an extra layer of compilation for something like weave. Instead of inserting C code into a string, wrapping it with module sticky stuff and compiling it, you'd insert Flex rules into the string, call a slightly higher level function which calls flex to generate the scanner code and use a slightly different bit of module sticky stuff to make it callable from Python. Skip

Skip Montanaro <skip@pobox.com>:
*snort* Took'em long enough. No, I shouldn't be snarky. Flex is only obvious to Unix old-timers -- the traditions that gave rise to it have fallen into desuetitude in the last decade.
Lexers are painful in Python. They hit the language in a weak spot created by the immutability of strings. I've found this an obstacle more than once, but then I'm a battle-scarred old compiler jock who attacks *everything* with lexers and parsers. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

>> ...insert Flex rules into the string, call a slightly higher level >> function which calls flex to generate the scanner code and use a >> slightly different bit of module sticky stuff to make it callable >> from Python. Eric> Lexers are painful in Python. They hit the language in a weak Eric> spot created by the immutability of strings. Yeah, that's why you inline what is essentially a .l file into your Python code. ;-) I'm actually here in Austin for a couple days visiting Eric Jones and the SciPy gang. Perhaps Eric and I can bat something out over lunch tomorrow... Skip

I think you're exaggerating the problem, or at least underestimating the re module. The re module is pretty fast! Reading a file line-by-line is very fast in Python 2.3 with the new "for line in open(filename)" idiom. I just scanned nearly a megabyte of ugly data (a Linux kernel) in 0.6 seconds using the regex '\w+', finding 177,000 words. The regex (?:\d+|[a-zA-Z_]+) took 1 second, yielding 1 second, finding 190,000 words. I expect that the list creation (one hit at a time) took more time than the matching. --Guido van Rossum (home page: http://www.python.org/~guido/)

I'm mildly curious why nobody has suggested mxTextTools or anything like that. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

I'm mildly curious why mxTextTools proponents
eh? why did my mailer send that mail? what I was trying to say is that I'm mildly curious why people tend to treat mxTextTools like some kind of silver bullet, without actually comparing it to a well- written regular expression. I've heard from people who've spent days rewriting their application, only to find that the resulting program was slower. (as Guido noted, for problems like this, the overhead isn't so much in the engine itself, as in the effort needed to create Python data structures...) </F>

On Wed, Aug 21, 2002, Fredrik Lundh wrote:
Okay, so that's one datapoint. I've never actually used mxTextTools; I'm mostly going by comments Tim Peters has made in the past suggesting that regex tools are poor for parsing. Since he's the one saying that regex is fast enough this time, I figured it'd be an appropriate time to throw up a question. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

On 21 Aug 2002 at 8:28, Aahz wrote:
They suck for parsing. They excel for lexing, however. -- Gordon http://www.mcmillan-inc.com/

aahz> I'm mostly going by comments Tim Peters has made in the past aahz> suggesting that regex tools are poor for parsing. parsing != tokenizing. ;-) Regular expressions are great for tokenizing (most of the time). Skip

On Wed, Aug 21, 2002, Skip Montanaro wrote:
Ah. Here we see one of the little drawbacks of not finishing my CS degree. ;-) Can someone suggest a good simple reference on the distinctions between parsing / lexing / tokenizing, particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology? -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

aahz wrote:
start here: http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=parser
particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology?
words tend to mean slightly different things in the XML universe, so I'll leave that to the XML experts. </F>

Aahz <aahz@pythoncraft.com>:
It's pretty simple, actually. Lexing *is* tokenizing; it's breaking the input stream into appropopriate lexical units. When you say "lexing" it implies that your tokenizer may be doing other things as well -- handling comment syntax, or gathering low-level semantic information like "this is a typedef". Parsing, on the other hand, consists of attempting to match your input to a grammar. The result of a parse is typically either "this matches" or to throw an error. There are two kinds of parsers -- event generators and structure builders. Event generators call designated hook functions when they recognize a piece of syntax you're interested in. In XML-land, SAX is like this. Structure builders return some data structure (typically a tree) representing the syntax of your input. In XML-land, DOM is like this. There is a vast literature on parsing. You don't need to know most of it. The key thing to remember is that, except for very simple cases, writing parsers by hand is usually stupid. Even when it's simple to do, machine-generated parsers have better hooks for error recovery. There are several `parser generator' tools that will compile a grammar specification to a parser; the best-known one is Bison, an open-source implementation of the classic Unix tool YACC (Yet Another Compiler Compiler). Python has its own parser generator, SPARK. Unfortunately, while SPARK is quite powerful (that is, good for handling ambiguities in the spec), the Earley algorithm it uses gives O(n**3) performance in the generated parser. It's not usable for production on larger than toy grammars. The Python standard library includes a lexer class suitable for a large class of shell-like syntaxes. As Guido has pointed out, regexps provide another attack on the problem. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

SPARK can hardly make clame to the fame of being "Python's own parser generator". While it's a parser generator for Python programs and itself written in Python, it is not distributed with Python. "Python's own" would be pgen, which lives in the Parser subdirectory of the Python source tree. Pgen is used to parse the Python source code and construct a parse tree out of it. As parser generators go, pgen is appropriately (and pythonically) stupid -- its power is restricted to that of LL(1) languages, equivalent to recursive-descent parsers. Its only interesting feature may be that it uses a regex-like notation to feed it the grammar for which to generate a parser. (That's what the *, ?, [], | and () meta-symbols in the file Grammar/Grammar are for.) I would note that for small languages (much smaller than Python), writing a recursive-descent parser by hand is actually one of the most effective ways of creating a parser. I recently had the pleasure to write a recursive-descent parser for a simple Boolean query language; there was absolutely no need to involve a big gun like a parser generator. OTOH I would not consider writing a recursive-descent parser by hand for Python's Grammar -- that's why I created pgen in the first place. :-) Another note for Aahz: when it comes to scanning data that's not really a programming language, e.g. email messages, the words parsing, scanning, lexing and tokenizing are often used pretty much interchangeably. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Wed, Aug 21, 2002 at 03:28:51PM -0400, Guido van Rossum wrote:
You might be interested to know that over in GCC land we're changing the C++ front end to use a hand-written recursive descent parser. It's not done yet, but we expect it to be easier to maintain, faster, and better at generating diagnostics than the existing yacc-based parser. zw

"ZW" == Zack Weinberg <zack@codesourcery.com> writes:
ZW> You might be interested to know that over in GCC land we're ZW> changing the C++ front end to use a hand-written recursive ZW> descent parser. It's not done yet, but we expect it to be ZW> easier to maintain, faster, and better at generating diagnostics ZW> than the existing yacc-based parser. LCC also uses a hand-written recursive descent parser, for exactly the reasons you mention. Thought I'd also mention a neat new paper about an old algorithm for recursive descent parsers with backtracking and unlimited lookahead. Packrat Parsing: Simple, Powerful, Lazy, Linear Time, Bryan Ford. ICFP 2002 http://www.brynosaurus.com/pub.html Jeremy

On Wednesday 21 August 2002 09:50 pm, Zack Weinberg wrote: ...
Interesting! This reminds me of a long-ago interview with Borland's techies about how they had managed to create Turbo Pascal, which ran well in a 64K (K, not M-) one-floppy PC, when their competitor, Microsoft Pascal, forced one to do a lot of disc-jockeying even with 256K and 2 floppies. Basically, their take was "we just did everything by the Dragon Book -- except that the parser is a hand-written recursive descent parser [Aho &c being adamant defenders of Yacc & the like], which buys us a lot" ... Alex

Alex Martelli <aleax@aleax.it>:
Even more impressive was the earlier version of Turbo Pascal which ran on 64K Z80-based CP/M systems! I have great respect for that one, because in a previous life I used it to develop a cross-compiler for a Modula-2-like language targeting the National 32000 architecture. My compiler consisted of 3 overlays (for parsing, declaration analysis and code generation), wasn't very fast, and had so little memory left for a symbol table that it could only compile very small modules. :-( In hindsight, my undoing was probably my insistence that the language not require forward declarations (it was my language, so I could make it how I wanted). If I had relaxed that, I could have used a single- pass design that would have simplified things considerably. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Thu, Aug 22, 2002, Greg Ewing wrote:
s/64K/48K/ I believe you could actually theoretically start it up with only 32K of memory, but you couldn't do any real work. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ Project Vote Smart: http://www.vote-smart.org/

"GvR" == Guido van Rossum <guido@python.org> writes:
GvR> Another note for Aahz: when it comes to scanning data that's GvR> not really a programming language, e.g. email messages, the GvR> words parsing, scanning, lexing and tokenizing are often used GvR> pretty much interchangeably. True, although even stuff like email messages are defined by a formal grammar, i.e. RFC 2822. email.Generator of course doesn't strictly use that grammar because it's trying to allow a much greater leniency in its input than a language compiler would. But note that approaches like Emacs's mail-extr.el package do in fact try to do more strict parsing based on the grammar. -Barry

On Wed, 21 Aug 2002, Guido van Rossum wrote:
As per Zach's comments, I think this is pretty funny. I have just spent more time trying to expose pgen to the interpreter than I took to write a R-D parser for Python 1.3 (granted, once Fred's parser module came around, I felt a bit silly). Considering the scope of my parser generator integration wishlist, having GCC move to a hand coded recursive descent parser is going to make my head explode. Even TenDRA (http://www.tendra.org/) used a LL(n) parser generator, despite its highly tweaked lookahead code. So now I'm going to have to extract grammars from call trees? As if the 500 languages problem isn't already intractable, there are going to be popular language implementations that don't even bother with an abstract syntax specificaiton!? (Stop me from further hyperbole if I am incorrect.) No wonder there are no killer software engineering apps. Maybe I should just start writing toy languages for kids... *smirk* -Jon

[Jonathan Riehl]
It seems a potential lesson went unlearned then <wink>.
Anyone writing an R-D parser by hand without a formal grammer to guide them is insane. The formal grammar likely won't capture everything, though -- but then they never do.
No wonder there are no killer software engineering apps. Maybe I should just start writing toy languages for kids...
Parser generators are great for little languages! They're painful for real languages, though, because syntax warts accumulate and then tool rigidity gets harder to live with. Hand-crafted R-D parsers are wonderfully tweakable in intuitive ways (staring at a mountain of parse-table conflicts and divining how to warp the grammar to shut the tool up is a black art nobody should regret not learning ...). 15 years of my previous lives were spent as a compiler jockey, working for HW vendors. The only time we used a parser generator was the time we used one written by a major customer, and for political reasons far more than technical ones. It worked OK in the end, but it didn't really save any time. It did save us from one class of error. I vividly recall a bug report against the previous Fortran compiler, where this program line (an approximation) CONTRAST = KNOB ** ROTATION apparently never got executed. It appeared to be an optimization bug at a fundamental level, as there was simply no code generated for this statement. After too much digging, we found that the guy who wrote the Fortran parser had done the equivalent of if not statement.has_label() and statement.startswith('CONT'): pass # an unlabelled CONTINUE statement can be ignored It's just that nobody had started a variable name with those 4 letters before. Yikes! I was afraid to fly for a year after <wink>. a-class-of-tool-most-appreciated-when-it's-least-needed-ly y'rs - tim

tim wrote:
cf. http://compilers.iecc.com/comparch/article/00-12-106 "For me and C++, [using a parser generator] was a bad mistake." </F>

Aahz <aahz@pythoncraft.com>:
Can someone suggest a good simple reference on the distinctions between parsing / lexing / tokenizing
Lexical analysis, otherwise known as "lexing" or "tokenising", is the process of splitting the input up into a sequence of "tokens", such as (in the case of a programming language) identifiers, operators, string literals, etc. Parsing is the next higher level in the process, which takes the sequence of tokens and recognises language constructs -- statements, expressions, etc.
particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology?
The lexing and parsing part of compiler technology isn't really any more arcane than it is for XML or anything else -- exactly the same principles apply. It's more a matter of how deeply you want to get into the theory. The standard text on this stuff around here seems to be Aho, Hopcroft and Ullman, "The Theory of Parsing, Translation and Compiling", but you might find that a bit much if all you want to do is parse XML. It will, however, give you a good grounding in the theory of REs, various classes of grammar, different parsing techniques, etc., after which writing an XML parser will seem like quite a trivial task. :-) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On 21 Aug 2002 at 10:34, Fredrik Lundh wrote:
mxTextTools lets (encourages?) you to break all the rules about lex -> parse. If you can (& want to) put a good deal of the "parse" stuff into the scanning rules, you can get a speed advantage. You're also not constrained by the rules of BNF, if you choose to see that as an advantage :-). My one successful use of mxTextTools came after using SPARK to figure out what I actually needed in my AST, and realizing that the ambiguities in the grammar didn't matter in practice, so I could produce an almost-AST directly. -- Gordon http://www.mcmillan-inc.com/

[Gordon McMillan]
I don't expect anyone will have much luck writing a fast lexer using mxTextTools *or* Python's regexp package unless they know quite a bit about how each works under the covers, and about how fast lexing is accomplished by DFAs. If you know both, you can build a DFA by hand and painfully instruct mxTextTools in the details of its construction, and get a very fast tokenizer (compared to what's possible with re), regardless of the number of token classes or the complexity of their definitions. Writing to mxTextTools directly is a lot like writing in an assembly language for a character-matching machine, with all the pains and potential joys that implies. If I were Eric, I'd use Flex <wink>.

Tim Peters wrote:
FYI, there are a few meta languages to make life easier for mxTextTools like e.g. Mike Fletcher's SimpleParse. The upcoming version 2.1 will also support Unicode and allows text jump targets which boosts readability of the tag tables a lot and makes hand-writing the tables much easier. The beta of 2.1 is available to the subscribers of the egenix-users mailing list. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[Eric S. Raymond]
... Lexers are painful in Python.
This is so.
They hit the language in a weak spot created by the immutability of strings.
But you lost me here -- I don't see a connection between immutability and either ease of writing lexers or speed of lexers. Indeed, lexers are (IME) more-or-less exactly as painful and slow written in Perl, where strings are mutable. Seems more to me that lexing is convenient and fast only when expressed in a language specifically designed for writing lexers, and backed by a specialized execution engine that knows a great deal about fast state-machine implementation. Like, say, Flex. Lexing is also clumsy and slow in SNOBOL4 and Icon, despite that they excel at higher-level pattern-matching tasks. IOW, lexing is in a world by itself, almost nothing is good at it, and the few things that shine at it don't do anything else. lexing-is-the-floating-point-execution-unit-of-the-language-world-ly y'rs - tim

tor 2002-08-22 klockan 03.13 skrev Tim Peters:
I've actually found that Haskell's pattern matching and lazy evaluation makes it pretty easy to write lexers. Too bad it's too hard to use Haskell together with other languages :( But then, writing a R-D parser in Haskell is piece-of-cake too :-) Martin

Tim Peters <tim.one@comcast.net>:
It's an implementation problem. You find yourself doing a lot of string accessing and pasting, creating several new objects per input char. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>

"Eric S. Raymond" <esr@thyrsus.com>:
Not necessarily! Plex manages to do it without any of that. The trick is to leave all the characters in the input buffer and just *count* how many characters make up the next token. Once you've decided where the token ends, one slice gives it to you. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[Greg Ewing]
Plex is very nice! It doesn't pass my "convient and fast" test only because the DFA at the end still runs at Python speed, and one character at a time is still mounds slower than it could be in C. Hmm. But you can also generate pretty reasonable C code from Python source now too! You're going to solve this yet, Greg. Note that mxTextTools also computes slice indices for "tagging", rather than build up new string objects. Heck, that's also why Guido (from the start) gave the regexp and string match+search gimmicks optional start-index and end-index arguments too, and why one of the "where did this group match?" flavors returns slice indices. I think Eric has spent too much time debugging C lately <wink>.

Hmm. But you can also generate pretty reasonable C code from Python source now too! You're going to solve this yet, Greg.
Yes, probably the first serious use I make of Pyrex will be to re-implement the inner loop of Plex so it runs at C speed. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

[attribution lost]
[Greg Ewing]
[/F]
you can do that without even looking at the characters?
1-character strings are shared; string[i] doesn't create a new object except for the first time that character is seen. string_item() in particular uses the character found to index into an array of 1-character string objects.

you can do that without even looking at the characters?
No, but the original complaint was that immutability of strings made lexing difficult. I was pointing out that it's possible to do it without mutating anything per-character. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

I haven't given up on the re module for fast scanners (see Tim's note on the speed of tokenizing 20,000 messages in mere minutes). Note that the Bayes approach doesn't *need* a trick to apply many regexes in parallel to the text. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido> I haven't given up on the re module for fast scanners (see Tim's Guido> note on the speed of tokenizing 20,000 messages in mere minutes). Guido> Note that the Bayes approach doesn't *need* a trick to apply many Guido> regexes in parallel to the text. Right. I'm thinking of it in situations where you do need such tricks. SpamAssassin is one such place. I think Eric has an application (quickly tokenizing the data produced by an external program, where the data can run into several hundreds of thousands of lines) where this might be beneficial as well. Skip

[Skip Montanaro]
This problem was also vivid in `procmail'-based SPAM filtering, as I observed it many years ago, and I remembered having quite lurked on the side of Flex at the time. I finally solved my own problem by writing a Python program able to sustain thousands of specialised rules rather speedily, proving once again that algorithmic approaches are often much slower than languages :-).
For a pure Python solution, PLEX could also be an avenue. It compiles a fast automaton, similar to what Flex does, from a grammar describing all tokens. I tried PLEX recently and was satisfied, even if not astonished by the speed. Also wanting fast parsing, I first used various Python heuristics, but the result was not solid enough to my taste. So I finally opted for Bison, and once on that road, it was just natural to rewrite the PLEX part in Flex. [Guido van Rossum]
I think you're exaggerating the problem, or at least underestimating the re module. The re module is pretty fast!
There are limits to what a backtracking matcher could do, speed-wise, when there are many hundreds of alternated patterns. [Eric S. Raymond]
Maybe lexing matches a grammar to a point, generates an event according to the match, advances the cursor and repeats until the end-of-file is met. Typically, parsing matches a grammar at point and does not repeat. In some less usual applications, they may be successive lexing stages, each taking its input from the output of the previous one. Parsing may be driven in stages too. I guess that we use the word `lexer' when the output structure is more flat, and `parser' when the output structure is more sexy! There are cases where the distinction is almost fuzzy. [Skip Montanaro]
Heap queues could be useful here as well. Consider you have hundreds of regexps to match in parallel on a big text. When the regexp is not too complex, it is more easy or likely that there exists a fast searcher for it. Let's build one searcher per regexp. Always beginning from the start of text, find the spot where each regexp first matches. Build a heap queue using the position as key, and both the regexp and match data as value. The top of the heap is the spot of your first match, process it, and while removing it from the heap, search forward from that spot for the same regexp, and add any result back to the heap. Merely repeat until the heap is empty. I'm not fully sure, but I have the intuition the above could be advantageous. -- François Pinard http://www.iro.umontreal.ca/~pinard

[François Pinard]
Sorry, my English is so unclear! I meant that people sometimes say Python is slow, yet because it allows clearer algorithms, one ends up with more speed in Python than other solutions in supposedly faster languages. For these other solutions, the slowness comes from the algorithms. People are often tempted to benchmark languages, while they should rather benchmark ideas. :-) -- François Pinard http://www.iro.umontreal.ca/~pinard

[Eric S. Raymond]
Hi, Paul! I believe Eric copied you on some concerns I had about the underpinnings of the algorithm, specifically about the final "is it spam?" computation. Looking at your links, I bet you got the formula from here: http://www.mathpages.com/home/kmath267.htm If so, the cause of the difficulty is that you inherited a subtle (because unstated) assumption from that writeup: I would suggest that we assume symmetry between "y" and "n". In other words, assume that probability of predicting correctly is the same regardless of whether the correct answer is "y" or "n". That part's fine. This implies p0=p7, p1=p6, p2=p5, and p3=p4, But that part doesn't follow from *just* the stated assumptions: note that those four equalities imply that p0+p2+p4+p6 = p7+p5+p3+p1 But the left-hand side of that is the probability that event X does not occur (it's all the rows with 'n' in the 'R' column), and the right-hand side is the probability that event X does occur (it's all the rows with 'y' in the 'R' column). In other words, this derivation also makes the stronger-- and unstated --assumption that X occurs with probability 1/2. The ultimate formula given on that page is correct if P(X)=0.5, but turns out it's wrong if P(X) isn't 0.5. Reality doesn't care how accurate Smith and Jones are, X occurs with its own probability regardless of what they think. Picture an extreme: suppose reality is such that X *always* occurs. Then p0 must be 0, and so must p2, p4 and p6 (the rows with 'n' in the R column can never happen if R is always 'y'). But then p0+p2+p4+p6 is 0 too, and the equality above implies p7+p5+p3+p1 is also 0. We reach the absurd conclusion that if X always occurs, the probability that X occurs is 0. As approximations to 1 go, 0 could stand some improvement <wink>. The math is easy enough to repair, but it may percolate into other parts of your algorithm. Chiefly, I *suspect* you found you needed to boost the "good count" by a factor of 2 because you actually have substantially more non-spam than spam in your inbox, and the scoring step was favoring "spam" more than it should have by virtue of neglecting to correct for that your real-life P(spam) is significantly less than 0.5 (although your training corpora had about the same number of spams as non-spams, so that P(spam)=0.5 was aprroximately true across your *training* data -- that's another complication). Makes sense? Once our testing setup is trustworthy, I'll try it both ways and report on results. In the meantime, it's something worth pondering.

[Paul Prescod]
Some perhaps relevant links (with no off-topic discusssion):
Damn -- wish I'd read that before. Among other things, Eric found a good use for Judy arrays <wink>.
Knew about that. Good stuff.
http://groups.google.com/groups?selm=ajk8mj%241c3qah%243%40ID-1259 32.news.dfncis.de
Seems confused, assuming Graham's approach is a minor variant of ifile's. But Graham's computation is to classic Bayesian classifiers (like ifile) as Python's lambda is to Lisp's <0.7 wink>. Heart of the confusion: Integrating the whole set of statistics together requires adding up statistics for _all_ the words found in a message, not just the words "sex" and "sexy." The rub is that Graham doesn't try to add up the statistics for all the words found in a msg. To the contrary, it ends up ignoring almost all of the words. In particular, if the database indicates that "sex" and "sexy" aren't good spam-vs-non-spam discriminators, Graham's approach ignores them completely (their presence or absence doesn't affect the final outcome at all -- it's like the words don't exist; this isn't what ifile does, and ifile probably couldn't get away with this because it's trying to do N-way classification instead of strictly 2-way -- someone who understands the math and reads Graham's article carefully will likely have a hard time figuring out what Bayes has to do with it at all! I sure did.).
"""My finding is that it is _nowhere_ near sufficient to have two populations, "spam" versus "not spam."
In ifile I believe that. But the data will speak for itself soon enough, so I'm not going to argue about this.

Tim Peters <tim.one@comcast.net>:
It's a freaking *ideal* use for Judy arrays. Platonically perfect. They couldn't fit better if they'd been designed for this application. Bogofilter was actually born in the moment that I realized this. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>
participants (21)
-
Aahz
-
Alex Martelli
-
barry@python.org
-
Charles Cazabon
-
Eric S. Raymond
-
Fredrik Lundh
-
Gordon McMillan
-
Greg Ewing
-
Guido van Rossum
-
Jeremy Hylton
-
Jonathan Riehl
-
M.-A. Lemburg
-
Martin Sjögren
-
Neil Schemenauer
-
Paul Prescod
-
pinard@iro.umontreal.ca
-
Skip Montanaro
-
Tim Peters
-
Tim Peters
-
Tim Peters
-
Zack Weinberg