[Python-Dev] Parsing vs. lexing.

Eric S. Raymond esr@thyrsus.com
Wed, 21 Aug 2002 12:54:10 -0400

Aahz <aahz@pythoncraft.com>:
> Ah.  Here we see one of the little drawbacks of not finishing my CS
> degree.  ;-)  Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing, particularly in the
> context of general string processing (e.g. XML) rather than the arcane
> art of compiler technology?

It's pretty simple, actually.  Lexing *is* tokenizing; it's breaking the 
input stream into appropopriate lexical units.  When you say "lexing" it
implies that your tokenizer may be doing other things as well -- handling 
comment syntax, or gathering low-level semantic information like "this 
is a typedef".

Parsing, on the other hand, consists of attempting to match your input
to a grammar.  The result of a parse is typically either "this
matches" or to throw an error.  

There are two kinds of parsers -- event generators and structure builders.
Event generators call designated hook functions when they recognize a
piece of syntax you're interested in.  In XML-land, SAX is like this.
Structure builders return some data structure (typically a tree)
representing the syntax of your input.  In XML-land, DOM is like this.

There is a vast literature on parsing.  You don't need to know most of
it.  The key thing to remember is that, except for very simple cases,
writing parsers by hand is usually stupid.  Even when it's simple to
do, machine-generated parsers have better hooks for error recovery.
There are several `parser generator' tools that will compile a grammar
specification to a parser; the best-known one is Bison, an open-source
implementation of the classic Unix tool YACC (Yet Another Compiler

Python has its own parser generator, SPARK.  Unfortunately, while
SPARK is quite powerful (that is, good for handling ambiguities in the
spec), the Earley algorithm it uses gives O(n**3) performance in the
generated parser. It's not usable for production on larger than toy

The Python standard library includes a lexer class suitable for a large
class of shell-like syntaxes. As Guido has pointed out, regexps provide
another attack on the problem.
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>