
Aahz <aahz@pythoncraft.com>:
Ah. Here we see one of the little drawbacks of not finishing my CS degree. ;-) Can someone suggest a good simple reference on the distinctions between parsing / lexing / tokenizing, particularly in the context of general string processing (e.g. XML) rather than the arcane art of compiler technology?
It's pretty simple, actually. Lexing *is* tokenizing; it's breaking the input stream into appropopriate lexical units. When you say "lexing" it implies that your tokenizer may be doing other things as well -- handling comment syntax, or gathering low-level semantic information like "this is a typedef". Parsing, on the other hand, consists of attempting to match your input to a grammar. The result of a parse is typically either "this matches" or to throw an error. There are two kinds of parsers -- event generators and structure builders. Event generators call designated hook functions when they recognize a piece of syntax you're interested in. In XML-land, SAX is like this. Structure builders return some data structure (typically a tree) representing the syntax of your input. In XML-land, DOM is like this. There is a vast literature on parsing. You don't need to know most of it. The key thing to remember is that, except for very simple cases, writing parsers by hand is usually stupid. Even when it's simple to do, machine-generated parsers have better hooks for error recovery. There are several `parser generator' tools that will compile a grammar specification to a parser; the best-known one is Bison, an open-source implementation of the classic Unix tool YACC (Yet Another Compiler Compiler). Python has its own parser generator, SPARK. Unfortunately, while SPARK is quite powerful (that is, good for handling ambiguities in the spec), the Earley algorithm it uses gives O(n**3) performance in the generated parser. It's not usable for production on larger than toy grammars. The Python standard library includes a lexer class suitable for a large class of shell-like syntaxes. As Guido has pointed out, regexps provide another attack on the problem. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>