[Python-ideas] Python-ideas Digest, Vol 103, Issue 49

Tue Jun 9 16:29:36 CEST 2015

This isn't directed at Andrew in particular but the discussion in
general, because it wasn't clear to me from how everyone was using the
words that the distinction Andrew mentions was clear.  PEGs are a class
of grammars analogous to context-free grammars, regular grammars, or (to
choose a more obscure example) Boolean grammars
( http://users.utu.fi/aleokh/boolean/ ).  PEGs are probably not
comparable to CFGs, while Boolean grammars are a strict superset of
CFGs.  Like other grammars, PEGs admit multiple parsing algorithms.  As
Andrew says and as far as I know, OMeta uses a top-down
recursive-descent algorithm with backtracking for parsing PEGs, which is
why it can go exponential on some inputs.  Packrat parsing is the
algorithm that Bryan Ford published at the same time as he introduced
PEGs, and it can parse PEGs in linear time by using memoization instead
of backtracking.  However, it's not the only algorithm that can parse
PEGs, and while I don't know any work on this, it seems plausible that
algorithms for parsing general CFGs like GLL or Earley could be adapted
to parsing PEGs.  Likewise, memoization can be used to avoid
backtracking in a top-down recursive-descent parser for CFGs, though
it's highly unlikely that any algorithm could achieve linear time for
ambiguous CFGs.

> > I don't know about OMeta, but the Earley parsing algorithm is
> worst-cast cubic time "quadratic time for unambiguous grammars, and
> linear time for almost all LR(k) grammars".
> 
> I don't know why you'd want to use Earley for parsing a programming
> language. IIRC, it was the first algorithm that could handle rampant
> ambiguity in polynomial time, but that isn't relevant to parsing
> programming languages (especially one like Python, which was
> explicitly designed to be simple to parse), and it isn't relevant to
> natural languages if you're not still in the 1960s, except in learning
> the theory and history of parsing. GLR does much better in
> almost-unambiguous/almost-deterministic languages; CYK can be easily
> extended with weights (which propagate sensibly, so you can use them
> for a final judgment, or to heuristically prune alternatives as you
> go); Valiant is easier to reason about mathematically; etc. And that's
> just among the parsers in the same basic family as Earley.

Do you have a source for the assertion that Earley is slower than GLR?
I've heard many people say this, but I've never seen any formal
comparisons made since Masaru Tomita's 1985 book, "Efficient parsing for
natural language: A fast algorithm for practical systems."

As far as I know, this is emphatically not true for asymptotic
complexity.  In 1991 Joop Leo published “A General Context-free Parsing
Algorithm Running in Linear Time on Every LR(K) Grammar Without Using
Lookahead," a modification to Earley's algorithm that makes it run in
linear time on LR-regular grammars.  The LR-regular grammars include the
LR(k) grammars for all k and are in fact a strict superset of the
deterministic grammars.  Again, as far as I know, GLR parsers run in
linear times on grammars depending on the LR table they're using, which
means in most cases LR(1) or something similar.  There are many
variations of the GLR algorithm now, though, so it's possible there's
one I don't know about that doesn't have this limitation.

As for constant factors, it's possible that GLR is better than Earley.
I'm reluctant to assume that Tomita's findings still hold, though,
because hardware has changed radically since 1985 and because both the
original versions of both GLR and Earley had bugs.  While there are now
versions of both algorithms that fix the bugs, the fixes may have
changed the constant factors.

> In fact, even if I wanted to write an amazing parser library for
> Python (and I kind of do, but I don't know if I have the time), I
> still don't think I'd want to suggest it as a replacement for the
> parser in CPython. Writing all the backward-compat adapters and
> porting the Python parser over with all its quirks intact and building
> the tests to prove that it's performance and error handling were
> strictly better and so on wouldn't be nearly as much fun as other
> things I could do with it.

I'm working on a new parser library for Python at
https://github.com/ceridwen/combinators that I intend to have some of
the features that have been discussed here (it can be used for
scanner-less/lexer-less parsing, if one wants; a general CFG algorithm
rather than some subset thereof; linear-time parsing on a reasonable
subset of CFGs; integrated semantic actions; embedding in Python rather
than having a separate DSL like EBNF) and some others that haven't been.
The research alone has been a lot of work, and the implementation no
less so.  I'd barely call what I have at the moment pre-alpha.  I'm
exploring an implementation based on a newer general CFG parsing
algorithm, GLL
( https://www.royalholloway.ac.uk/computerscience/research/csle/gllparsers.aspx ), though I'd like to compare it to Earley on constant factors.

Cara