Re: [Python-ideas] Python-ideas Digest, Vol 103, Issue 49

This isn't directed at Andrew in particular but the discussion in general, because it wasn't clear to me from how everyone was using the words that the distinction Andrew mentions was clear. PEGs are a class of grammars analogous to context-free grammars, regular grammars, or (to choose a more obscure example) Boolean grammars ( http://users.utu.fi/aleokh/boolean/ ). PEGs are probably not comparable to CFGs, while Boolean grammars are a strict superset of CFGs. Like other grammars, PEGs admit multiple parsing algorithms. As Andrew says and as far as I know, OMeta uses a top-down recursive-descent algorithm with backtracking for parsing PEGs, which is why it can go exponential on some inputs. Packrat parsing is the algorithm that Bryan Ford published at the same time as he introduced PEGs, and it can parse PEGs in linear time by using memoization instead of backtracking. However, it's not the only algorithm that can parse PEGs, and while I don't know any work on this, it seems plausible that algorithms for parsing general CFGs like GLL or Earley could be adapted to parsing PEGs. Likewise, memoization can be used to avoid backtracking in a top-down recursive-descent parser for CFGs, though it's highly unlikely that any algorithm could achieve linear time for ambiguous CFGs.
Do you have a source for the assertion that Earley is slower than GLR? I've heard many people say this, but I've never seen any formal comparisons made since Masaru Tomita's 1985 book, "Efficient parsing for natural language: A fast algorithm for practical systems." As far as I know, this is emphatically not true for asymptotic complexity. In 1991 Joop Leo published “A General Context-free Parsing Algorithm Running in Linear Time on Every LR(K) Grammar Without Using Lookahead," a modification to Earley's algorithm that makes it run in linear time on LR-regular grammars. The LR-regular grammars include the LR(k) grammars for all k and are in fact a strict superset of the deterministic grammars. Again, as far as I know, GLR parsers run in linear times on grammars depending on the LR table they're using, which means in most cases LR(1) or something similar. There are many variations of the GLR algorithm now, though, so it's possible there's one I don't know about that doesn't have this limitation. As for constant factors, it's possible that GLR is better than Earley. I'm reluctant to assume that Tomita's findings still hold, though, because hardware has changed radically since 1985 and because both the original versions of both GLR and Earley had bugs. While there are now versions of both algorithms that fix the bugs, the fixes may have changed the constant factors.
I'm working on a new parser library for Python at https://github.com/ceridwen/combinators that I intend to have some of the features that have been discussed here (it can be used for scanner-less/lexer-less parsing, if one wants; a general CFG algorithm rather than some subset thereof; linear-time parsing on a reasonable subset of CFGs; integrated semantic actions; embedding in Python rather than having a separate DSL like EBNF) and some others that haven't been. The research alone has been a lot of work, and the implementation no less so. I'd barely call what I have at the moment pre-alpha. I'm exploring an implementation based on a newer general CFG parsing algorithm, GLL ( https://www.royalholloway.ac.uk/computerscience/research/csle/gllparsers.asp... ), though I'd like to compare it to Earley on constant factors. Cara
participants (1)
-
Cara