Mailman 3 Python syntax checker ? - Python-Dev

newer
Re: [Python-Dev] FW: regarding the...

Python syntax checker ?

Martin von Loewis

Sept. 20, 2000

5:07 p.m.

Would it be possible to write a Python syntax checker that doesn't stop processing at the first error it finds but instead tries to continue as far as possible (much like make -k) ?

In "Compilerbau", this is referred to as "Fehlerstabilisierung". I suggest to have a look at the dragon book (Aho, Seti, Ullman). The common approch is to insert or remove tokens, using some heuristics. In YACC, it is possible to add error productions to the grammar. Whenever an error occurs, the parser assigns all tokens to the "error" non-terminal until it concludes that it can perform a reduce action. A similar approach might work for the Python Grammar. For each production, you'd define a set of stabilization tokens. If these are encountered, then the rule would be considered complete. Everything is consumed until a stabilization token is found. For example, all expressions could be stabilized with a keyword. I.e. if you encounter a syntax error inside an expression, you ignore all tokens until you see 'print', 'def', 'while', etc. In some cases, it may be better to add input rather than removing it. For example, if you get an "inconsistent dedent" error, you could assume that this really was a consistent dedent, or you could assume it was not meant as a dedent at all. Likewise, if you get a single-quote start-of-string, with no single-quote until end-of-line, you just should assume there was one. Adding error productions to ignore input until stabilization may be feasible on top of the existing parser. Adding tokens in the right place is probably harder - I'd personally go for a pure Python solution, that operates on Grammar/Grammar. Regards, Martin

Show replies by date

Thomas Wouters

September 2000

8:02 p.m.

On Wed, Sep 20, 2000 at 07:07:06PM +0200, Martin von Loewis wrote:

...

Adding error productions to ignore input until stabilization may be feasible on top of the existing parser. Adding tokens in the right place is probably harder - I'd personally go for a pure Python solution, that operates on Grammar/Grammar.

Don't forget that there are two kinds of SyntaxErrors in Python: those that are generated by the tokenizer/parser, and those that are actually generated by the (bytecode-)compiler. (inconsistent indent/dedent errors, incorrect uses of (augmented) assignment, incorrect placing of particular keywords, etc, are all generated while actually compiling the code.) Also, in order to be really useful, the error-indicator would have to be pretty intelligent. Imagine something like this: if 1: doodle() forever() and_ever() <tons more code using 4-space indent> With the current interpreter, that would generate a single warning, on the line below the one that is the actual problem. If you continue searching for errors, you'll get tons and tons of errors, all because the first line was indented too far. An easy way to work around it is probably to consider all tokenizer-errors and some of the compiler-generated errors (like indent/dedent ones) as really-fatal errors, and only handle the errors that are likely to managable errors, skipping over the affected lines or considering them no-ops. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

M.-A. Lemburg

10:32 a.m.

Martin von Loewis wrote:

...

...
Would it be possible to write a Python syntax checker that doesn't stop processing at the first error it finds but instead tries to continue as far as possible (much like make -k) ?

In "Compilerbau", this is referred to as "Fehlerstabilisierung". I suggest to have a look at the dragon book (Aho, Seti, Ullman).

The common approch is to insert or remove tokens, using some heuristics. In YACC, it is possible to add error productions to the grammar. Whenever an error occurs, the parser assigns all tokens to the "error" non-terminal until it concludes that it can perform a reduce action.

A similar approach might work for the Python Grammar. For each production, you'd define a set of stabilization tokens. If these are encountered, then the rule would be considered complete. Everything is consumed until a stabilization token is found.

For example, all expressions could be stabilized with a keyword. I.e. if you encounter a syntax error inside an expression, you ignore all tokens until you see 'print', 'def', 'while', etc.

In some cases, it may be better to add input rather than removing it. For example, if you get an "inconsistent dedent" error, you could assume that this really was a consistent dedent, or you could assume it was not meant as a dedent at all. Likewise, if you get a single-quote start-of-string, with no single-quote until end-of-line, you just should assume there was one.

Adding error productions to ignore input until stabilization may be feasible on top of the existing parser. Adding tokens in the right place is probably harder - I'd personally go for a pure Python solution, that operates on Grammar/Grammar.

I think I'd prefer a Python solution too -- perhaps I could start out with tokenizer.py and muddle along that way. pylint from Aaron Waters should also provide some inspiration. Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

James C. Ahlstrom

1:55 p.m.

Martin von Loewis wrote:

...

...
Would it be possible to write a Python syntax checker that doesn't stop processing at the first error it finds but instead tries to continue as far as possible (much like make -k) ?

The common approch is to insert or remove tokens, using some heuristics. In YACC, it is possible to add error productions to the grammar. Whenever an error occurs, the parser assigns all tokens to the "error" non-terminal until it concludes that it can perform a reduce action.

The following is based on trying (a great learning experience) to write a better Python lint. There are IMHO two problems with the current Python grammar file. It is not possible to express operator precedence, so deliberate shift/reduce conflicts are used instead. That makes the parse tree complicated and non intuitive. And there is no provision for error productions. YACC has both of these as built-in features. I also found speed problems with tokenize.py. AFAIK, it only exists because tokenizer.c does not provide comments as tokens, but eats them instead. We could modify tokenizer.c, then make tokenize.py be the interface to the fast C tokenizer. This eliminates the problem of updating both too. So how about re-writing the Python grammar in YACC in order to use its more advanced features?? The simple YACC grammar I wrote for 1.5.2 plus an altered tokenizer.c parsed the whole Lib/*.py in a couple seconds vs. 30 seconds for the first file using Aaron Watters' Python lint grammar written in Python. JimA

bwarsaw＠beopen.com

2:18 p.m.

...

...
...
...
...
"JCA" == James C Ahlstrom <jim@interet.com> writes:

JCA> So how about re-writing the Python grammar in YACC in JCA> order to use its more advanced features?? The simple JCA> YACC grammar I wrote for 1.5.2 plus an altered tokenizer.c JCA> parsed the whole Lib/*.py in a couple seconds vs. 30 JCA> seconds for the first file using Aaron Watters' Python JCA> lint grammar written in Python. I've been wanting to check out Antlr (www.antlr.org) because it gives us the /possibility/ to use the same grammar files for both CPython and JPython. One problem though is that it generates Java and C++ so we'd be accepting our first C++ into the core if we went this route. -Barry

Greg Ward

2:43 p.m.

On 25 September 2000, Barry A. Warsaw said:

...

I've been wanting to check out Antlr (www.antlr.org) because it gives us the /possibility/ to use the same grammar files for both CPython and JPython. One problem though is that it generates Java and C++ so we'd be accepting our first C++ into the core if we went this route.

Or contribute a C back-end to ANTLR -- I've been toying with this idea for, ummm, too damn long now. Years. Greg

bwarsaw＠beopen.com

3:23 p.m.

...

...
...
...
...
"GW" == Greg Ward <gward@mems-exchange.org> writes:

GW> Or contribute a C back-end to ANTLR -- I've been toying with GW> this idea for, ummm, too damn long now. Years. Yes (to both :).

...

...
...
...
...
"NS" == Neil Schemenauer <nascheme@enme.ucalgary.ca> writes:

NS> How different are PCCTS and ANTLR? Perhaps we could use PCCTS NS> for CPython and ANTLR for JPython. Unknown. It would only make sense if the same grammar files could be fed to each. I have no idea whether that's true or not. If not, Greg's idea is worth researching. -Barry

Greg Ward

4:22 p.m.

On 25 September 2000, Barry A. Warsaw said:

...

NS> How different are PCCTS and ANTLR? Perhaps we could use PCCTS NS> for CPython and ANTLR for JPython.

Unknown. It would only make sense if the same grammar files could be fed to each. I have no idea whether that's true or not. If not, Greg's idea is worth researching.

PCCTS 1.x grammar files tend to have lots of C code interwoven in them -- at least for tricky, ill-defined grammars like BibTeX. ;-) ANTLR 2.x grammars certainly allow Java code to be woven into them; I assume you can instead weave C++ or Sather if that's your preference. Obviously, this would be one problem with having a common grammar for JPython and CPython. Greg

James C. Ahlstrom

3:05 p.m.

"Barry A. Warsaw" wrote:

...

I've been wanting to check out Antlr (www.antlr.org) because it gives us the /possibility/ to use the same grammar files for both CPython and JPython. One problem though is that it generates Java and C++ so we'd be accepting our first C++ into the core if we went this route.

Yes, but why not YACC? Is Antlr so much better, or is YACC too primitive, or what? IMHO, adding C++ just for parsing is not going to happen, so Antlr is not going to happen either. JimA

Martin von Loewis

3:36 p.m.

...

Yes, but why not YACC? Is Antlr so much better, or is YACC too primitive, or what? IMHO, adding C++ just for parsing is not going to happen, so Antlr is not going to happen either.

I think the advantage that Barry saw is that ANTLR generates Java in addition to C, so it could be used in JPython as well. In addition, ANTLR is more advanced than YACC; it specifically supports full EBNF as input, and has better mechanisms for conflict resolution. On the YACC for Java side, Axel Schreiner has developed jay, see http://www2.informatik.uni-osnabrueck.de/bernd/jay/staff/design/de/Artikel.h... (if you read German, otherwise don't bother :-) The main problem with multilanguage output is the semantic actions - it would be quite a stunt to put semantic actions into the parser which are valid both in C and Java :-) On that front, there is also CUP (http://www.cs.princeton.edu/~appel/modern/java/CUP/), which has different markup for Java actions ({: ... :}). There is also BYACC/J, a patch to Berkeley Yacc to produce Java (http://www.lincom-asg.com/~rjamison/byacc/). Personally, I'm quite in favour of having the full parser source (including parser generator if necessary) in the Python source distribution. As a GCC contributor, I know what pain it is for users that GCC requires bison to build - even though it is only required for CVS builds, as distributions come with the generated files. Regards, Martin

Greg Ward

6:05 p.m.

On 25 September 2000, Martin von Loewis said:

...

Personally, I'm quite in favour of having the full parser source (including parser generator if necessary) in the Python source distribution. As a GCC contributor, I know what pain it is for users that GCC requires bison to build - even though it is only required for CVS builds, as distributions come with the generated files.

This would be a strike against ANTLR, since it's written in Java -- and therefore is about as portable as a church. ;-( It should be possible to generate good, solid, portable C code... but AFAIK no one has done so to date with ANTLR 2.x. Greg

James C. Ahlstrom

8:48 p.m.

Martin von Loewis wrote:

...

...
Yes, but why not YACC? Is Antlr so much better, or is

...

I think the advantage that Barry saw is that ANTLR generates Java in addition to C, so it could be used in JPython as well. In addition, ANTLR is more advanced than YACC; it specifically supports full EBNF as input, and has better mechanisms for conflict resolution.

Oh, OK. Thanks.

...

Personally, I'm quite in favour of having the full parser source (including parser generator if necessary) in the Python source distribution. As a GCC contributor, I know what pain it is for users that GCC requires bison to build - even though it is only required for CVS builds, as distributions come with the generated files.

I see your point, but the practical solution that we can do today is to use YACC, bison, and distribute the generated parser files. Jim

Jeremy Hylton

9:14 p.m.

...

...
...
...
...
"JCA" == James C Ahlstrom <jim@interet.com> writes:

...

...
Personally, I'm quite in favour of having the full parser source (including parser generator if necessary) in the Python source distribution. As a GCC contributor, I know what pain it is for users that GCC requires bison to build - even though it is only required for CVS builds, as distributions come with the generated files.

JCA> I see your point, but the practical solution that we can do JCA> today is to use YACC, bison, and distribute the generated JCA> parser files. I don't understand what problem this is a practical solution to. This thread started with MAL's questions about finding errors in Python code. You mentioned an effort to write a lint-like tool. It may be that YACC has great support for error recovery, in which case MAL might want to look at for his tool. But in general, the most practical solution for parsing Python is probably to use the Python parser and the builtin parser module. It already exists and seems to work just fine. Jeremy

Guido van Rossum

1:44 a.m.

...

I don't understand what problem this is a practical solution to. This thread started with MAL's questions about finding errors in Python code. You mentioned an effort to write a lint-like tool. It may be that YACC has great support for error recovery, in which case MAL might want to look at for his tool.

But in general, the most practical solution for parsing Python is probably to use the Python parser and the builtin parser module. It already exists and seems to work just fine.

Probably not that relevant any more, but MAL originally asked for a parser that doesn't stop at the first error. That's a real weakness of the existing parser!!! --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)

Greg Ewing

1:13 a.m.

Guido:

...

MAL originally asked for a parser that doesn't stop at the first error. That's a real weakness of the existing parser!!!

Is it really worth putting a lot of effort into this? In my experience, the vast majority of errors I get from Python are run-time errors, not parse errors. (If you could find multiple run-time errors in one go, *that* would be an impressive trick!) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Michael Hudson

12:15 p.m.

On Tue, 26 Sep 2000, Greg Ewing wrote:

...

Guido:

...
MAL originally asked for a parser that doesn't stop at the first error. That's a real weakness of the existing parser!!!

Is it really worth putting a lot of effort into this?

It might be if you were trying to develop an IDE that could syntactically analyse what the user was typing even if he/she had left a half finished expression further up in the buffer (I'd kind of assumed this was the goal). So you're not continuing after errors, exactly, more like unfinishednesses (or some better word...). I guess one approach to this would be to divided up the buffer according to indentation and then parse each block as delimited by the indentation individually. Two random points: 1) Triple-quoted strings are going to be a problem. 2) Has anyone gotten flex to tokenize Python? I was looking at the manual yesterday and it didn't look impossible, although a bit tricky. Cheers, M.

Greg Ewing

12:51 a.m.

By the way, one of the examples that comes with my Plex module is an almost-complete Python scanner. Just thought I'd mention it in case it would help anyone. http://www.cosc.canterbury.ac.nz/~greg/python/Plex Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

James C. Ahlstrom

1:23 p.m.

Jeremy Hylton wrote:

...

I don't understand what problem this is a practical solution to.

To recover from errors better by using YACC's built-in error recovery features. Maybe unifying the C and Java parsers. I admit I don't know how J-Python parses Python. I kind of threw in my objection to tokenize.py which should be combined with tokenizer.c. Of course it is work which only results in the same operation as before, but reduces the code base. Not a popular project.

...

But in general, the most practical solution for parsing Python is probably to use the Python parser and the builtin parser module. It already exists and seems to work just fine.

A very good point. I am not 100% sure it is worth it. But I found the current parser unworkable for my project. JimA

bwarsaw＠beopen.com

2:43 p.m.

...

...
...
...
...
"JCA" == James C Ahlstrom <jim@interet.com> writes:

JCA> To recover from errors better by using YACC's built-in error JCA> recovery features. Maybe unifying the C and Java parsers. I JCA> admit I don't know how J-Python parses Python. It uses JavaCC. http://www.metamata.com/javacc/ -Barry

Jeremy Hylton

2:50 p.m.

...

...
...
...
...
"JCA" == James C Ahlstrom <jim@interet.com> writes:

JCA> The following is based on trying (a great learning experience) JCA> to write a better Python lint. JCA> There are IMHO two problems with the current Python grammar JCA> file. It is not possible to express operator precedence, so JCA> deliberate shift/reduce conflicts are used instead. That makes JCA> the parse tree complicated and non intuitive. And there is no JCA> provision for error productions. YACC has both of these as JCA> built-in features. JCA> I also found speed problems with tokenize.py. AFAIK, it only JCA> exists because tokenizer.c does not provide comments as tokens, JCA> but eats them instead. We could modify tokenizer.c, then make JCA> tokenize.py be the interface to the fast C tokenizer. This JCA> eliminates the problem of updating both too. JCA> So how about re-writing the Python grammar in YACC in order to JCA> use its more advanced features?? The simple YACC grammar I JCA> wrote for 1.5.2 plus an altered tokenizer.c parsed the whole JCA> Lib/*.py in a couple seconds vs. 30 seconds for the first file JCA> using Aaron Watters' Python lint grammar written in Python. Why not use the actual Python parser instead of tokenize.py? I assume it is also faster than Aaron's Python lint grammer written in Python. The compiler in Tools/compiler uses the parser module internally and produces an AST that is straightforward to use. (The parse tree produced by the parser module is fairly low-level.) There was a thread (on the compiler-sig, I believe) where Moshe and I noodled with a simple lint-like warnings framework based on the compiler package. I don't have the code we ended up with, but I found an example checker in the mail archives and have included it below. It checks for NameErrors. I believe one useful change that Moshe and I arrived at was to avoid the explicit stack that the code uses (via enterNamespace and exitNamespace) and instead pass the namespace as an optional extra argument to the visitXXX methods. Jeremy """Check for NameErrors""" from compiler import parseFile, walk from compiler.misc import Stack, Set import __builtin__ from UserDict import UserDict class Warning: def __init__(self, filename, funcname, lineno): self.filename = filename self.funcname = funcname self.lineno = lineno def __str__(self): return self._template % self.__dict__ class UndefinedLocal(Warning): super_init = Warning.__init__ def __init__(self, filename, funcname, lineno, name): self.super_init(filename, funcname, lineno) self.name = name _template = "%(filename)s:%(lineno)s " \ "%(funcname)s undefined local %(name)s" class NameError(UndefinedLocal): _template = "%(filename)s:%(lineno)s " \ "%(funcname)s undefined name %(name)s" class NameSet(UserDict): """Track names and the line numbers where they are referenced""" def __init__(self): self.data = self.names = {} def add(self, name, lineno): l = self.names.get(name, []) l.append(lineno) self.names[name] = l class CheckNames: def __init__(self, filename): self.filename = filename self.warnings = [] self.scope = Stack() self.gUse = NameSet() self.gDef = NameSet() # _locals is the stack of local namespaces # locals is the top of the stack self._locals = Stack() self.lUse = None self.lDef = None self.lGlobals = None # var declared global # holds scope,def,use,global triples for later analysis self.todo = [] def enterNamespace(self, node): self.scope.push(node) self.lUse = use = NameSet() self.lDef = _def = NameSet() self.lGlobals = gbl = NameSet() self._locals.push((use, _def, gbl)) def exitNamespace(self): self.todo.append((self.scope.top(), self.lDef, self.lUse, self.lGlobals)) self.scope.pop() self._locals.pop() if self._locals: self.lUse, self.lDef, self.lGlobals = self._locals.top() else: self.lUse = self.lDef = self.lGlobals = None def warn(self, warning, funcname, lineno, *args): args = (self.filename, funcname, lineno) + args self.warnings.append(apply(warning, args)) def defName(self, name, lineno, local=1): if self.lUse is None: self.gDef.add(name, lineno) elif local == 0: self.gDef.add(name, lineno) self.lGlobals.add(name, lineno) else: self.lDef.add(name, lineno) def useName(self, name, lineno, local=1): if self.lUse is None: self.gUse.add(name, lineno) elif local == 0: self.gUse.add(name, lineno) self.lUse.add(name, lineno) else: self.lUse.add(name, lineno) def check(self): for s, d, u, g in self.todo: self._check(s, d, u, g, self.gDef) # XXX then check the globals def _check(self, scope, _def, use, gbl, globals): # check for NameError # a name is defined iff it is in def.keys() # a name is global iff it is in gdefs.keys() gdefs = UserDict() gdefs.update(globals) gdefs.update(__builtin__.__dict__) defs = UserDict() defs.update(gdefs) defs.update(_def) errors = Set() for name in use.keys(): if not defs.has_key(name): firstuse = use[name][0] self.warn(NameError, scope.name, firstuse, name) errors.add(name) # check for UndefinedLocalNameError # order == use & def sorted by lineno # elements are lineno, flag, name # flag = 0 if use, flag = 1 if def order = [] for name, lines in use.items(): if gdefs.has_key(name) and not _def.has_key(name): # this is a global ref, we can skip it continue for lineno in lines: order.append((lineno, 0, name)) for name, lines in _def.items(): for lineno in lines: order.append((lineno, 1, name)) order.sort() # ready contains names that have been defined or warned about ready = Set() for lineno, flag, name in order: if flag == 0: # use if not ready.has_elt(name) and not errors.has_elt(name): self.warn(UndefinedLocal, scope.name, lineno, name) ready.add(name) # don't warn again else: ready.add(name) # below are visitor methods def visitFunction(self, node, noname=0): for expr in node.defaults: self.visit(expr) if not noname: self.defName(node.name, node.lineno) self.enterNamespace(node) for name in node.argnames: self.defName(name, node.lineno) self.visit(node.code) self.exitNamespace() return 1 def visitLambda(self, node): return self.visitFunction(node, noname=1) def visitClass(self, node): for expr in node.bases: self.visit(expr) self.defName(node.name, node.lineno) self.enterNamespace(node) self.visit(node.code) self.exitNamespace() return 1 def visitName(self, node): self.useName(node.name, node.lineno) def visitGlobal(self, node): for name in node.names: self.defName(name, node.lineno, local=0) def visitImport(self, node): for name, alias in node.names: self.defName(alias or name, node.lineno) visitFrom = visitImport def visitAssName(self, node): self.defName(node.name, node.lineno) def check(filename): global p, checker p = parseFile(filename) checker = CheckNames(filename) walk(p, checker) checker.check() for w in checker.warnings: print w if __name__ == "__main__": import sys # XXX need to do real arg processing check(sys.argv[1])

8914

Age (days ago)

8921

Last active (days ago)

List overview

Download

19 comments

10 participants

participants (10)

bwarsaw＠beopen.com
Greg Ewing
Greg Ward
Guido van Rossum
James C. Ahlstrom
Jeremy Hylton
M.-A. Lemburg
Martin von Loewis
Michael Hudson
Thomas Wouters

Python syntax checker ?

Martin von Loewis

Thomas Wouters

M.-A. Lemburg

James C. Ahlstrom

bwarsaw＠beopen.com

Greg Ward

bwarsaw＠beopen.com

Greg Ward

James C. Ahlstrom

Martin von Loewis

Greg Ward

James C. Ahlstrom

Jeremy Hylton

Guido van Rossum

Greg Ewing

Michael Hudson

Greg Ewing

James C. Ahlstrom

bwarsaw＠beopen.com

Jeremy Hylton

tags

participants (10)