lexical analysis of python

Paul McGuire ptmcg at austin.rr.com
Wed Mar 11 02:38:53 CET 2009

On Mar 10, 8:31 pm, robert.mull... at gmail.com wrote:
> I am trying to implement a lexer and parser for a subset of python
> using lexer and parser generators. (It doesn't matter, but I happen to
> be using
> ocamllex and ocamlyacc). I've run into the following annoying problem
> and hoping someone can tell me what I'm missing. Lexers generated by
> such tools return a tokens in a stream as they consume the input text.
> But python's indentation appears to require interruption of that
> stream. For example, in:
> def f(x):
>         statement1;
>         statement2;
>               statement3;
>               statement4;
> A
> Between the '\n' at the end of statement4 and the A, a lexer for
> Python should return 2 DEDENT tokens. But there is no way to interject
> two DEDENT tokens within the token stream between the tokens for
> NEWLINE and A.  The generated lexer doesn't have anyway to freeze the
> input text pointer.
> Does this mean that python lexers are all written by hand? If not, how
> do you do it using your favorite lexer generator?
> Thanks!
> Bob Muller

In pyparsing's indentedBlock expression/helper, I keep a stack of
column numbers representing indent levels.  When the indent level of a
line is less than the column number at the top of the stack, I count
one DEDENT for each level that I need to pop off the stack before I
get the new indent column.  If I get a column number less than the
indent column, then I know that this is an illegal indent (doesn't
line up with previous indent).  Also, when computing the column
number, be wary of tab handling.

-- Paul

More information about the Python-list mailing list