Fwd: New Javascript parser in the works
Forgot to send to pypy-dev. And now I commited code that was supposed to do what I wanted with the tests in revision 42384. But the code is ugly, my visitor somehow don't visit everyone it should and I am to sleepy to code. Am I doing things the wrong way or doing what I wanted with the tests that hard? Iniciar mensagem reenviada:
De: Leonardo Santagada <santagada@gmail.com> Data: 27 de abril de 2007 19h24min15s GMT-03:00 Para: Carl Friedrich Bolz <cfbolz@gmx.de> Assunto: Re: [pypy-dev] New Javascript parser in the works
Em 27/04/2007, às 17:11, Carl Friedrich Bolz escreveu:
Carl Friedrich Bolz wrote:
Just looking at the grammar made me note the following problem: [snip]
hm, now that I found this problem, which grammar are you using exactly? The one at
the one in the standard, as you sugestd to me... it is at: http://www.ecma-international.org/publications/standards/Ecma-262.htm
http://www.mozilla.org/js/language/js20/formal/parser-grammar.html
gives the rules for addition correctly:
AdditiveExpression ==> MultiplicativeExpression | AdditiveExpression + MultiplicativeExpression | AdditiveExpression - MultiplicativeExpression
or did you just rewrite it incorrectly?
I both rewritten it incorrectly and also there is a problem, if I put MultiplicativeExpression first like on the standard and in the mozilla one I get code like 5+4 interpreted as having one multiplicative expression with 5 and somehow it doesn't consume all the input and still thinks it is valid.
The other problem was that I didn't know how to fix it, so I just did what I could expecting that with some code people would help me more (this one I got right :). The biggest problem with parsing is that there is no one using it besides you... at least you help me a lot so i'm not really complaning.
About tests I'm having this problem, I want to create generator tests like some that I have but I would like to do something like having lots of lines of tests, maybe with comments and maybe setting what would be the start simbol to try to match it. Also I would like to be able compare the results doing tests like: checking the number of a type of node on the tree (eg: 5+5 should have 2 numeric literals). The perfect thing would be able to print how the trace of the packratparser, seeing what rules it is executing. But I really want to have a better testing suite, but like just failing some test for a reason I do not know or passing a test but doing things completely wrong is not sufficient for me now... I will work to make this test tool.
Now about semicolons, how should I deal with them? in the spec the grammar doesn't deal with them and in the mozilla one I don't see how they are doing it also. As we have set that as the parsing module works today it is not possible to do automatic semicolon insertion, can we do "forced semicolon presence" as seen on C and Java? (some lightbulb just lightened up here, maybe I should look for the grammar of any of those two languages)
Cheers,
Carl Friedrich
-- Leonardo Santagada santagada@gmail.com
-- Leonardo Santagada santagada@gmail.com
On Sat, 28 Apr 2007 02:50:38 +0200, Leonardo Santagada <santagada@gmail.com> wrote:
Now about semicolons, how should I deal with them? in the spec the grammar doesn't deal with them and in the mozilla one I don't see how they are doing it also. As we have set that as the parsing module works today it is not possible to do automatic semicolon insertion, can we do "forced semicolon presence" as seen on C and Java? (some lightbulb just lightened up here, maybe I should look for the grammar of any of those two languages)
This is the biggest and hardest problem about js parsing. The spec does define how to handle it, I'm not sure now how the grammar reflects that though. Looking at C doesn't help, because there it always needs to be present and can't be replaced with newlines. The problem with a js parser is, that newlines aren't really whitespace, just like in python. But the rules are weird, because newlines are only sometimes relevant, not everytime. A js parser which doesn't handle this correctly is in my opinion just wrong. You couldn't parse any real world javascript with it. Regards, Florian Schulze
Em 28/04/2007, às 09:47, Florian Schulze escreveu:
This is the biggest and hardest problem about js parsing. The spec does define how to handle it, I'm not sure now how the grammar reflects that though. Looking at C doesn't help, because there it always needs to be present and can't be replaced with newlines. The problem with a js parser is, that newlines aren't really whitespace, just like in python. But the rules are weird, because newlines are only sometimes relevant, not everytime. A js parser which doesn't handle this correctly is in my opinion just wrong. You couldn't parse any real world javascript with it
internet exlporer doens't do ASI (automatic semicolon insertion for now own I will call it ASI) properly so for you account it is wrong... If I remember currently all modern javascript frameworks use ES (explicit semicolon) because of diferences in real world parsers. What we can do is provide a tool to put the ; on your legacy code, but that is something that people should be doing anyway so it is not a biggie for me. So I am going with ES for now... when cfbolz and I get some time together to work on making it work then we can try to be exactly to the spec. And the formal grammar doesn't reflect this at all, besides being wrong (there is an errata in the mozilla site that fixes it a bit) and have some parts missing, it completely ignores ASI and the present it as textual information (a very very informal specification). I'm going to focus on making a real test suite for my js parser... if it doesn't drive me crazy I will be back here.
.
Regards, Florian Schulze
-- Leonardo Santagada santagada@gmail.com
On Sat, 28 Apr 2007 02:50:38 +0200, Leonardo Santagada <santagada@gmail.com> wrote:
Now about semicolons, how should I deal with them? in the spec the grammar doesn't deal with them and in the mozilla one I don't see how they are doing it also. As we have set that as the parsing module works today it is not possible to do automatic semicolon insertion, can we do "forced semicolon presence" as seen on C and Java? (some lightbulb just lightened up here, maybe I should look for the grammar of any of those two languages)
This is the biggest and hardest problem about js parsing. The spec does define how to handle it, I'm not sure now how the grammar reflects that though. Looking at C doesn't help, because there it always needs to be present and can't be replaced with newlines. The problem with a js
is, that newlines aren't really whitespace, just like in python. But
Florian Schulze wrote: parser the
rules are weird, because newlines are only sometimes relevant, not everytime. A js parser which doesn't handle this correctly is in my opinion just wrong. You couldn't parse any real world javascript with it.
The part of the spec that describes the automatic semicolon insertion is completely silly, in my opinion. It goes something like this: When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true: ... This is completely crazy, because it effectively forces you to write a parser using a left-to-right parsing technique and also a parser that works by doing exactly one token lookahead. The second part is what makes packrat parsing fail, since it uses arbitrary many tokens lookahead, so you cannot really determine what an "offending token" is since you cannot distinguish it from normal backtracking. I don't see how you can fix that, really. Now you have basically two choices: you can change force all semicolons to be inserted, which makes most code out there not parse. The other one is more brainstormy-like, it does not work as I describe but maybe someone has an idea to get it to work: you could change the grammar to be very lenient with semicolons (at least for a packrat parser this might be easy) meaning that it will programs as valid that existing Javascript engines will reject. Something like this: a = b c = d would be valid. This opens its own set of problems such as: a = b ++ c Which would most likely be parsed to be equivalent to: a = b++; c; Whereas with the spec it is: a = b; ++c; No clue how to fix that, yet. Cheers, Carl Friedrich
Florian Schulze wrote:
On Sat, 28 Apr 2007 02:50:38 +0200, Leonardo Santagada <santagada@gmail.com> wrote:
Now about semicolons, how should I deal with them? in the spec the grammar doesn't deal with them and in the mozilla one I don't see how they are doing it also. As we have set that as the parsing module works today it is not possible to do automatic semicolon insertion, can we do "forced semicolon presence" as seen on C and Java? (some lightbulb just lightened up here, maybe I should look for the grammar of any of those two languages)
This is the biggest and hardest problem about js parsing. The spec does define how to handle it, I'm not sure now how the grammar reflects that though. Looking at C doesn't help, because there it always needs to be present and can't be replaced with newlines. The problem with a js parser is, that newlines aren't really whitespace, just like in python. But the rules are weird, because newlines are only sometimes relevant, not everytime. A js parser which doesn't handle this correctly is in my opinion just wrong. You couldn't parse any real world javascript with it.
The part of the spec that describes the automatic semicolon insertion is completely silly, in my opinion. It goes like this: When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true: ... This is completely crazy, because it effectively forces you to write a parser using a left-to-right parsing technique with exactly one token lookahead, by hand (because a lot of parsing frameworks don't allow you to do the customization that is necessary). The lookahead part is what makes packrat parsing fail, since it uses arbitrary many tokens lookahead, so you cannot really determine what an "offending token" is since you cannot distinguish it from normal backtracking. I don't see how you can fix that, really. Now you have basically two choices: you can force the user to insert all semicolons, which makes most code out there not parse. The other one is more brainstormy-like, it does not work as I describe but maybe someone has an idea to get it to work: you could change the grammar to be very lenient with semicolons (at least for a packrat parser this might be easy) meaning that it will programs as valid that existing Javascript engines will reject. Something like this: a = b c = d would be valid. This opens its own set of problems such as: a = b ++ c Which would most likely be parsed to be equivalent to: a = b++; c; Whereas with the spec it is: a = b; ++c; No clue how to fix that, yet. Maybe you could do something similar to what Python does, inserting newline tokens into the token stream and removing those between matched pairs of parenthesis. Cheers, Carl Friedrich
participants (3)
-
Carl Friedrich Bolz
-
Florian Schulze
-
Leonardo Santagada