[Python-Dev] Parsing vs. lexing.

Guido van Rossum guido@python.org
Wed, 21 Aug 2002 15:28:51 -0400


> There are several `parser generator' tools that will compile a grammar
> specification to a parser; the best-known one is Bison, an open-source
> implementation of the classic Unix tool YACC (Yet Another Compiler
> Compiler).
> 
> Python has its own parser generator, SPARK.  Unfortunately, while
> SPARK is quite powerful (that is, good for handling ambiguities in the
> spec), the Earley algorithm it uses gives O(n**3) performance in the
> generated parser. It's not usable for production on larger than toy
> grammars.
> 
> The Python standard library includes a lexer class suitable for a large
> class of shell-like syntaxes. As Guido has pointed out, regexps provide
> another attack on the problem.

SPARK can hardly make clame to the fame of being "Python's own parser
generator".  While it's a parser generator for Python programs and
itself written in Python, it is not distributed with Python.
"Python's own" would be pgen, which lives in the Parser subdirectory
of the Python source tree.  Pgen is used to parse the Python source
code and construct a parse tree out of it.  As parser generators go,
pgen is appropriately (and pythonically) stupid -- its power is
restricted to that of LL(1) languages, equivalent to recursive-descent
parsers.  Its only interesting feature may be that it uses a
regex-like notation to feed it the grammar for which to generate a
parser.  (That's what the *, ?, [], | and () meta-symbols in the file
Grammar/Grammar are for.)

I would note that for small languages (much smaller than Python),
writing a recursive-descent parser by hand is actually one of the most
effective ways of creating a parser.  I recently had the pleasure to
write a recursive-descent parser for a simple Boolean query language;
there was absolutely no need to involve a big gun like a parser
generator.  OTOH I would not consider writing a recursive-descent
parser by hand for Python's Grammar -- that's why I created pgen in
the first place. :-)

Another note for Aahz: when it comes to scanning data that's not
really a programming language, e.g. email messages, the words parsing,
scanning, lexing and tokenizing are often used pretty much
interchangeably.

--Guido van Rossum (home page: http://www.python.org/~guido/)