On Mon, Jul 15, 2019 at 8:47 PM Andrew Barnert firstname.lastname@example.org wrote:
On Jul 15, 2019, at 18:44, Nam Nguyen email@example.com wrote:
I have implemented a tiny (~200 SLOCs) package at https://gitlab.com/nam-nguyen/parser_compynator that demonstrates something like this is possible. There are several examples for you to have a feel of it, as well as some early benchmark numbers to consider. This is far smaller than any of the Python parsing libraries I have looked at, yet more universal than many of them. I hope that it would convert the skeptics ;).
For at least some of your use cases, I don’t think it’s a problem that it’s 70x slower than the custom parsers you’d be replacing. How often do you need to parse a million URLs in your inner loop? Also, if the function composition is really the performance hurdle, can you optimize that away relatively simply, just by building an explicit tree (expression-template style) and walking the tree in a __call__ method, rather than building an implicit tree of nested calls? (And that could be optimized further if needed, e.g. by turning the tree walk into a simple virtual machine where all of the fundamental operations are inlined into the loop, and maybe even accelerating that with C code.)
But I do think it’s a problem that there seems to be no way to usefully indicate failure to the caller, and I’m not sure that could be fixed as easily.
An empty set signifies the parse has failed. Perhaps I have misunderstood what you indicated here.
Invalid inputs in your readme examples don’t fail, they successfully return an empty set.
Because the library supports ambiguity, it can return more than one parse results. The guarantee here is if it returns an empty set, the parse has failed.
There also doesn’t seem to be any way to trigger a hard fail rather than a backtrack.
You can have a parser that raises an exception. None of the primitive parsers do that though.
So I’m not sure how a real urlparse replacement could do the things the current one does, like raising a ValueError on https://abc.d%5Bef.ghi/ complaining that the netloc looks like an invalid IPv6 address. (Maybe you could def a function that raises a ValueError and attach it as a where somewhere in the parser tree? But even if that works, wouldn’t you get a meaningless exception that doesn’t have any information about where in the source text or where in the parse tree it came from or why it was raised, and, as your readme says, a stack trace full of garbage?)
urlparse right now raises ValueError('Invalid IPv6 URL'). It does not mention where in the source text the error comes from.
Can you add failure handling without breaking the “~200LOC and easy to read” feature of the library, and without breaking the “easy to read once you grok parser combinators” feature of the parsers built with it?
This is a good request. I will have to play around with this idea more. What I think could be the most challenging task is to attribute failure to appropriate rule(s) (i.e. expr expects a term + term, but you only have term +). I feel like some metadata about the grammar might be required here, and that might be too unwieldy to provide in a parser combinator formulation. Interestingly enough, regex doesn't have anything like this either.