[Python-ideas] Hooking between lexer and parser

Sat Jun 6 07:29:21 CEST 2015

On Sat, Jun 6, 2015 at 1:00 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 6 June 2015 at 12:21, Neil Girdhar <mistersheik at gmail.com> wrote:
> > I'm curious what other people will contribute to this discussion as I
> think
> > having no great parsing library is a huge hole in Python.  Having one
> would
> > definitely allow me to write better utilities using Python.
>
> The design of *Python's* grammar is deliberately restricted to being
> parsable with an LL(1) parser. There are a great many static analysis
> and syntax highlighting tools that are able to take advantage of that
> simplicity because they only care about the syntax, not the full
> semantics.
>

Given the validation that happens, it's not actually LL(1) though.  It's
mostly LL(1) with some syntax errors that are raised for various illegal
constructs.

Anyway, no one is suggesting changing the grammar.

> Anyone actually doing their *own* parsing of something else *in*
> Python, would be better advised to reach for PLY
> (https://pypi.python.org/pypi/ply ). PLY is the parser underlying
> https://pypi.python.org/pypi/pycparser, and hence the highly regarded
> CFFI library, https://pypi.python.org/pypi/cffi
>
> Other notable parsing alternatives folks may want to look at include
> https://pypi.python.org/pypi/lrparsing and
> http://pythonhosted.org/pyparsing/ (both of which allow you to use
> Python code to define your grammar, rather than having to learn a
> formal grammar notation).
>
>
I looked at ply and pyparsing, but it was impossible to simply parse LaTeX
because I couldn't explain to suck up the right number of arguments given
the name of the function.  When it sees a function, it learns how many
arguments that function needs.  When it sees a function call \a{1}{2}{3},
if "\a" takes 2 arguments, then it should only suck up 1 and 2 as
arguments, and leave 3 as a regular text token. In other words, I should be
able to tell the parser what to expect in code that lives on the rule edges.

The parsing tools you listed work really well until you need to do
something like (1) the validation step that happens in Python, or (2)
figuring out exactly where the syntax error is (line and column number) or
(3) ensuring that whitespace separates some tokens even when it's not
required to disambiguate different parse trees.  I got the impression that
they wanted to make these languages simple for the simple cases, but they
were made too simple and don't allow you to do everything in one simple
pass.

Best,

Neil

> Regards,
> Nick.
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150606/d6c4148a/attachment-0001.html>