lexical analysis of python
ptmcg at austin.rr.com
Wed Mar 11 02:38:53 CET 2009
On Mar 10, 8:31 pm, robert.mull... at gmail.com wrote:
> I am trying to implement a lexer and parser for a subset of python
> using lexer and parser generators. (It doesn't matter, but I happen to
> be using
> ocamllex and ocamlyacc). I've run into the following annoying problem
> and hoping someone can tell me what I'm missing. Lexers generated by
> such tools return a tokens in a stream as they consume the input text.
> But python's indentation appears to require interruption of that
> stream. For example, in:
> def f(x):
> Between the '\n' at the end of statement4 and the A, a lexer for
> Python should return 2 DEDENT tokens. But there is no way to interject
> two DEDENT tokens within the token stream between the tokens for
> NEWLINE and A. The generated lexer doesn't have anyway to freeze the
> input text pointer.
> Does this mean that python lexers are all written by hand? If not, how
> do you do it using your favorite lexer generator?
> Bob Muller
In pyparsing's indentedBlock expression/helper, I keep a stack of
column numbers representing indent levels. When the indent level of a
line is less than the column number at the top of the stack, I count
one DEDENT for each level that I need to pop off the stack before I
get the new indent column. If I get a column number less than the
indent column, then I know that this is an illegal indent (doesn't
line up with previous indent). Also, when computing the column
number, be wary of tab handling.
More information about the Python-list