[Python-Dev] Python syntax checker ?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Sep 2000 18:39:22 +0200

Greg Ward wrote:
> On 20 September 2000, M.-A. Lemburg said:
> > Would it be possible to write a Python syntax checker that doesn't
> > stop processing at the first error it finds but instead tries
> > to continue as far as possible (much like make -k) ?
> >
> > If yes, could the existing Python parser/compiler be reused for
> > such a tool ?
> >From what I understand of Python's parser and parser generator, no.
> Recovering from errors is indeed highly non-trivial.  If you're really
> interested, I'd look into Terence Parr's ANTLR -- it's a very fancy
> parser generator that's waaay ahead of pgen (or lex/yacc, for that
> matter).  ANTLR 2.x is highly Java-centric, and AFAIK doesn't yet have a
> C backend (grumble) -- just C++ and Java.  (Oh wait, the antlr.org web
> site says it can generate Sather too -- now there's an important
> mainstream language!  ;-)

Thanks, I'll have a look.
> Tech notes: like pgen, ANTLR is LL; it generates a recursive-descent
> parser.  Unlike pgen, ANTLR is LL(k) -- it can support arbitrary
> lookahead, although k>2 can make parser generation expensive (not
> parsing itself, just turning your grammar into code), as well as make
> your language harder to understand.  (I have a theory that pgen's k=1
> limitation has been a brick wall in the way of making Python's syntax
> more complex, i.e. it's a *feature*!)
> More importantly, ANTLR has good support for error recovery.  My BibTeX
> parser has a lot of fun recovering from syntax errors, and (with a
> little smoke 'n mirrors magic in the lexing stage) does a pretty good
> job of it.  But you're right, it's *not* trivial to get this stuff
> right.  And without support from the parser generator, I suspect you
> would be in a world of hurtin'.

I was actually thinking of extracting the Python tokenizer and
parser from the Python source and tweaking it until it did
what I wanted it to do, ie. not generate valid code but produce
valid error messages ;-)

Now from the feedback I got it seems that this is not the
right approach. I'm not even sure whether using a parser
at all is the right way... I may have to stick to a fairly
general tokenizer and then try to solve the problem in chunks
of code (much like what Guido hinted at in his reply), possibly
even by doing trial and error using the Python builtin compiler
on these chunks.

Oh well,
Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/