[pypy-dev] Fwd: New Javascript parser in the works

Sat Apr 28 18:22:50 CEST 2007

Florian Schulze wrote:
 > On Sat, 28 Apr 2007 02:50:38 +0200, Leonardo Santagada
 > <santagada at gmail.com> wrote:
 >
 >
 >>> Now about semicolons, how should I deal with them? in the spec the
 >>> grammar doesn't deal with them and in the mozilla one I don't see
 >>> how they are doing it also. As we have set that as the parsing
 >>> module works today it is not possible to do automatic semicolon
 >>> insertion, can we do "forced semicolon presence" as seen on C and
 >>> Java? (some lightbulb just lightened up here, maybe I should look
 >>> for the grammar of any of those two languages)
 >
 > This is the biggest and hardest problem about js parsing. The spec does
 > define how to handle it, I'm not sure now how the grammar reflects that
 > though. Looking at C doesn't help, because there it always needs to be
 > present and can't be replaced with newlines. The problem with a js 
parser
 > is, that newlines aren't really whitespace, just like in python. But 
the
 > rules are weird, because newlines are only sometimes relevant, not
 > everytime. A js parser which doesn't handle this correctly is in my
 > opinion just wrong. You couldn't parse any real world javascript with it.

The part of the spec that describes the automatic semicolon insertion is
completely silly, in my opinion. It goes something like this:

     When, as the program is parsed from left to right, a token (called
     the offending token) is encountered that is not allowed by any
     production of the grammar, then a semicolon is automatically
     inserted before the offending token if one or more of the following
     conditions is true:

     ...

This is completely crazy, because it effectively forces you to write a
parser using a left-to-right parsing technique and also a parser that
works by doing exactly one token lookahead. The second part is what
makes packrat parsing fail, since it uses arbitrary many tokens
lookahead, so you cannot really determine what an "offending token" is
since you cannot distinguish it from normal backtracking. I don't see
how you can fix that, really.

Now you have basically two choices: you can change force all semicolons
to be inserted, which makes most code out there not parse.  The other
one is more brainstormy-like, it does not work as I describe but maybe
someone has an idea to get it to work: you could change the grammar to
be very lenient with semicolons (at least for a packrat parser this
might be easy) meaning that it will programs as valid that existing
Javascript engines will reject. Something like this:

     a = b c = d

would be valid. This opens its own set of problems such as:

     a = b
     ++ c

Which would most likely be parsed to be equivalent to:

     a = b++;
     c;

Whereas with the spec it is:

     a = b;
     ++c;

No clue how to fix that, yet.

Cheers,

Carl Friedrich