[pypy-dev] Blog / Parser

Fri Jan 11 22:22:13 CET 2008

Hi Toby,

Toby Watson wrote:
 > I've just read the blog post, "Visualizing a Python tokenizer" and it
 > reminded me of this:
 >
 > "OMeta: an object oriented language for pattern matching"
 > http://www.cs.ucla.edu/~awarth/papers/dls07.pdf
 >
 > OMeta is an extension and generalisation of the idea of PEGs*. It
 > provides a nice way to describe a language both at the character level
 > (tokens), the grammar itself and productions into the AST. Finally the
 > grammars are extensible (possibly from within the language itself).

Yes, I know of OMeta and this paper. It contains some nice things, but
in my opinion there is quite some hype around it too (and the paper is
at least missing a good "Prior work" section):

  - semantic-wise (apart from supporting left-recursion) their parsers
    are not more powerful than simple recursive-descent parsers with
    backtracking

  - it's not clear that the memoizing that packrat parsers do is a good
    idea for practical parsers in general. See this paper:
http://www.cs.mu.oz.au/research/mercury/information/papers.html#packrat-padl08

  - Prolog supported matching on non-stream data structures long ago
    _and_ is more powerful. It is also as efficient as packrat parsers if
    you use tabling (which is not a standard Prolog feature admittedly).

  - while I have not actually tried OMeta I somehow fear that debugging
    these programs if they behave unexpectedly is extremely annoying.
    Backtracking with first-match semantics can lead to unpleasant
    surprises and it's not always easy to see what is going on.

So what is left is some nice syntax, the inheritance support (which you
can rather easily get with a manually written recursive descent parser
too) and left recursion support.

 > The implementation is discussed in "Packrat Parsers Can Support Left
 > Recursion" and there is some discussion of the performance there. 
http://www.vpri.org/pdf/packrat_TR-2007-002.pdf

I have only glanced at it, I guess I should read it in more detail.
FWIW, the left-recursion support is basically exactly what a tabling
Prolog does, just less general.

 > I wonder whether the same idea behind PyPy can be applied to the
 > grammar. Write a program in some language (a python version of OMeta
 > for instance) which is then transformed by the translator, compiler,
 > or JIT into something that runs fast.

Maybe, yes. FWIW, we support simple packrat parsers, see here:

http://codespeak.net/pypy/dist/pypy/doc/rlib.html#parsing

It's far from OMeta, just a fairly boring parser generator. It contains
good support for regular expressions, though, which is where the diagram
from the blog post comes from.

I also experimented a bit with something OMeta-like in PyPy, you can
look at the tests here:

http://codespeak.net/svn/pypy/dist/pypy/rlib/parsing/test/test_pypackrat.py

It supports the left recursion and the composability. I am not sure it
is a good idea, though. The resulting code is RPython and can be
translated to mostly efficient C.

 > What could be nice about this is bringing the tokenising and parsing
 > closer in spirit to the heart of PyPy, writing 'nicer' code, and
 > providing a (I think tantalising) way to try new syntax going forward.

We don't really use any advanced stuff for our Python parser because we
want to be able to use the CPython's grammar files, which contain LL(1)
grammars anyway, so why bother with anything advanced? If someone wants
to support user-modifiable grammars or something like that he can
probably do so easily with PyPy. The PyPy core team is usually keeping
out of such "language-designy" topics (although I guess I would be
interested at least on the implementation-side of things if somebody
stepped forward to go in such a direction).

 >
 > And there are things to play with on this page:
 > http://www.cs.ucla.edu/~awarth/ometa/ometa-js/
 >
 > * Parsing Expression Grammar
 >
 >
 > With regard to railroad diagrams (I think that's what they're called):
 >
 > There used to be a script that generated them - it's mentioned at the
 > top of the python grammar file, and here 
http://www.python.org/search/hypermail/python-1994q3/0294.html
 >
 > But I've seen discussion elsewhere that it has been lost :(
 >
 > How about this? 
http://www.informatik.uni-freiburg.de/~thiemann/haskell/ebnf2ps/README

ebnf2ps seems to be using haskell that is no longer liked by modern 
haskell compilers. I don't know any haskell but whacking at the source a 
bit produced some sort of executable, haven't tried to use it yet.

Cheers,

Carl Friedrich