[Python-Dev] RE: Python-Dev digest, Vol 1 #2574 - 14 msgs

Noah noah@noah.org
Wed, 21 Aug 2002 13:42:10 -0700


On Wed, 21 Aug 2002 Aahz wrote:
> On Wed, Aug 21, 2002, Skip Montanaro wrote: 
> > parsing != tokenizing. ;-)
> > Regular expressions are great for tokenizing (most of the time).
> Ah.  Here we see one of the little drawbacks of not finishing my CS
> degree.  ;-)  Can someone suggest a good simple reference on the
> distinctions between parsing / lexing / tokenizing, particularly in the
> context of general string processing (e.g. XML) rather than the arcane
> art of compiler technology?
> Aahz (aahz@pythoncraft.com)           <*>         http://www.pythoncraft.com/

It's been 8 or 9 years since I took a compiler design class, so this info
is probably be WRONG, but as luck would have it I've been reviewing
some of this stuff lately so I can feel some of the old neuron paths
warming up. 

Basically the distinction between a lexer and a parser refers to 
the complexity of symbol inputs that they can recognize.
A lexer (AKA tokenizer) is modeled by a finite state machine (FSM).
These don't have a stack or memory, just a state. They are not
good for things that require nested structure.

A parser recognizes more complex structures. They are good for
things like XML and source code where you have NESTED tree structures
(familiarly known as SYNTAX). If you have to remember how many levels deep 
you are in something then it means you need a parser. Technically a parser is 
something that can be defined by a context free grammar and can 
be recognized by a Push Down Automata (PDA). A PDA is a FSM with memory.
A PDA has at least one stack. The "context free" on the grammar means
that you can unambiguously recognize any section of the input stream 
no matter what came earlier in the stream. ... Sometimes real grammars
are a little dirty and context does matter, but only within a small window.
That means you might have to "look ahead" or behind a few symbols to
eliminate some ambiguity. The amount that you have to look ahead sets
the complexity of the grammar. This is called LALR (look ahead left reduce).
So a simple grammar with no look ahead is called LALR(0). A slightly more 
complex grammar that requires 1 symbol look-ahead is called LALR(1).
We like most parsing to be simple. I think languages like C and Python
require are LALR(0). I think FORTRAN does require a look-ahead, so
it's LALR(1). I have no idea what it must require to parse Perl.
[Again Note: I sure I probably got some details wrong here.]

If you go further in complexity and you want to handle evaluating expressions 
then you need a Turing Machine (TM). These are problems where
a context-free grammar cannot be tweaked with a look-ahead.
Another way to think about it is if your input is so complex 
that it must be described algorithmically  then 
you need a TM. For example neither a FSM nor a PDA can 
recognize a non-rational number like SQRT(2) or pi. 
Nor can they recognize the truthfulness of expressions like "2*5=10" 
(although a PDA can recognize that it's a valid expression).

RegExs are hybrids of FSM and PDA and are fine for ad hoc lexing. 
They are not very good for parsing. I think older style RegExs started
life off as pure FSM, but newer flavors made popular by Perl added memory and
became PDAs... or something like that. But somehow they are limited and
not quite as powerful as a real PDA, so they can't parse.

Traditionally C programmers used a combination of LEX and YACC
(or GNU's versions of FLEX and Bison) to build parsers. 
You really only need YACC, but the problem is so much simpler if 
the input stream is tokenized before you try to parse it, 
which is why you also use LEX.

Hopefully that captures the essence if not the actual facts.
But then I'm trying to compress one year of computer science study
into this email :-)

If you are interested I just wrote a little PDA class for Python.

Yours,
Noah