Fwd: New Javascript parser in the works

Forgot to send to pypy-dev. And now I commited code that was supposed to do what I wanted with the tests in revision 42384. But the code is ugly, my visitor somehow don't visit everyone it should and I am to sleepy to code. Am I doing things the wrong way or doing what I wanted with the tests that hard? Iniciar mensagem reenviada:
-- Leonardo Santagada santagada@gmail.com

On Sat, 28 Apr 2007 02:50:38 +0200, Leonardo Santagada <santagada@gmail.com> wrote:
This is the biggest and hardest problem about js parsing. The spec does define how to handle it, I'm not sure now how the grammar reflects that though. Looking at C doesn't help, because there it always needs to be present and can't be replaced with newlines. The problem with a js parser is, that newlines aren't really whitespace, just like in python. But the rules are weird, because newlines are only sometimes relevant, not everytime. A js parser which doesn't handle this correctly is in my opinion just wrong. You couldn't parse any real world javascript with it. Regards, Florian Schulze

Em 28/04/2007, às 09:47, Florian Schulze escreveu:
internet exlporer doens't do ASI (automatic semicolon insertion for now own I will call it ASI) properly so for you account it is wrong... If I remember currently all modern javascript frameworks use ES (explicit semicolon) because of diferences in real world parsers. What we can do is provide a tool to put the ; on your legacy code, but that is something that people should be doing anyway so it is not a biggie for me. So I am going with ES for now... when cfbolz and I get some time together to work on making it work then we can try to be exactly to the spec. And the formal grammar doesn't reflect this at all, besides being wrong (there is an errata in the mozilla site that fixes it a bit) and have some parts missing, it completely ignores ASI and the present it as textual information (a very very informal specification). I'm going to focus on making a real test suite for my js parser... if it doesn't drive me crazy I will be back here.
-- Leonardo Santagada santagada@gmail.com

is, that newlines aren't really whitespace, just like in python. But
Florian Schulze wrote: parser the
The part of the spec that describes the automatic semicolon insertion is completely silly, in my opinion. It goes something like this: When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true: ... This is completely crazy, because it effectively forces you to write a parser using a left-to-right parsing technique and also a parser that works by doing exactly one token lookahead. The second part is what makes packrat parsing fail, since it uses arbitrary many tokens lookahead, so you cannot really determine what an "offending token" is since you cannot distinguish it from normal backtracking. I don't see how you can fix that, really. Now you have basically two choices: you can change force all semicolons to be inserted, which makes most code out there not parse. The other one is more brainstormy-like, it does not work as I describe but maybe someone has an idea to get it to work: you could change the grammar to be very lenient with semicolons (at least for a packrat parser this might be easy) meaning that it will programs as valid that existing Javascript engines will reject. Something like this: a = b c = d would be valid. This opens its own set of problems such as: a = b ++ c Which would most likely be parsed to be equivalent to: a = b++; c; Whereas with the spec it is: a = b; ++c; No clue how to fix that, yet. Cheers, Carl Friedrich

Florian Schulze wrote:
The part of the spec that describes the automatic semicolon insertion is completely silly, in my opinion. It goes like this: When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true: ... This is completely crazy, because it effectively forces you to write a parser using a left-to-right parsing technique with exactly one token lookahead, by hand (because a lot of parsing frameworks don't allow you to do the customization that is necessary). The lookahead part is what makes packrat parsing fail, since it uses arbitrary many tokens lookahead, so you cannot really determine what an "offending token" is since you cannot distinguish it from normal backtracking. I don't see how you can fix that, really. Now you have basically two choices: you can force the user to insert all semicolons, which makes most code out there not parse. The other one is more brainstormy-like, it does not work as I describe but maybe someone has an idea to get it to work: you could change the grammar to be very lenient with semicolons (at least for a packrat parser this might be easy) meaning that it will programs as valid that existing Javascript engines will reject. Something like this: a = b c = d would be valid. This opens its own set of problems such as: a = b ++ c Which would most likely be parsed to be equivalent to: a = b++; c; Whereas with the spec it is: a = b; ++c; No clue how to fix that, yet. Maybe you could do something similar to what Python does, inserting newline tokens into the token stream and removing those between matched pairs of parenthesis. Cheers, Carl Friedrich

On Sat, 28 Apr 2007 02:50:38 +0200, Leonardo Santagada <santagada@gmail.com> wrote:
This is the biggest and hardest problem about js parsing. The spec does define how to handle it, I'm not sure now how the grammar reflects that though. Looking at C doesn't help, because there it always needs to be present and can't be replaced with newlines. The problem with a js parser is, that newlines aren't really whitespace, just like in python. But the rules are weird, because newlines are only sometimes relevant, not everytime. A js parser which doesn't handle this correctly is in my opinion just wrong. You couldn't parse any real world javascript with it. Regards, Florian Schulze

Em 28/04/2007, às 09:47, Florian Schulze escreveu:
internet exlporer doens't do ASI (automatic semicolon insertion for now own I will call it ASI) properly so for you account it is wrong... If I remember currently all modern javascript frameworks use ES (explicit semicolon) because of diferences in real world parsers. What we can do is provide a tool to put the ; on your legacy code, but that is something that people should be doing anyway so it is not a biggie for me. So I am going with ES for now... when cfbolz and I get some time together to work on making it work then we can try to be exactly to the spec. And the formal grammar doesn't reflect this at all, besides being wrong (there is an errata in the mozilla site that fixes it a bit) and have some parts missing, it completely ignores ASI and the present it as textual information (a very very informal specification). I'm going to focus on making a real test suite for my js parser... if it doesn't drive me crazy I will be back here.
-- Leonardo Santagada santagada@gmail.com

is, that newlines aren't really whitespace, just like in python. But
Florian Schulze wrote: parser the
The part of the spec that describes the automatic semicolon insertion is completely silly, in my opinion. It goes something like this: When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true: ... This is completely crazy, because it effectively forces you to write a parser using a left-to-right parsing technique and also a parser that works by doing exactly one token lookahead. The second part is what makes packrat parsing fail, since it uses arbitrary many tokens lookahead, so you cannot really determine what an "offending token" is since you cannot distinguish it from normal backtracking. I don't see how you can fix that, really. Now you have basically two choices: you can change force all semicolons to be inserted, which makes most code out there not parse. The other one is more brainstormy-like, it does not work as I describe but maybe someone has an idea to get it to work: you could change the grammar to be very lenient with semicolons (at least for a packrat parser this might be easy) meaning that it will programs as valid that existing Javascript engines will reject. Something like this: a = b c = d would be valid. This opens its own set of problems such as: a = b ++ c Which would most likely be parsed to be equivalent to: a = b++; c; Whereas with the spec it is: a = b; ++c; No clue how to fix that, yet. Cheers, Carl Friedrich

Florian Schulze wrote:
The part of the spec that describes the automatic semicolon insertion is completely silly, in my opinion. It goes like this: When, as the program is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true: ... This is completely crazy, because it effectively forces you to write a parser using a left-to-right parsing technique with exactly one token lookahead, by hand (because a lot of parsing frameworks don't allow you to do the customization that is necessary). The lookahead part is what makes packrat parsing fail, since it uses arbitrary many tokens lookahead, so you cannot really determine what an "offending token" is since you cannot distinguish it from normal backtracking. I don't see how you can fix that, really. Now you have basically two choices: you can force the user to insert all semicolons, which makes most code out there not parse. The other one is more brainstormy-like, it does not work as I describe but maybe someone has an idea to get it to work: you could change the grammar to be very lenient with semicolons (at least for a packrat parser this might be easy) meaning that it will programs as valid that existing Javascript engines will reject. Something like this: a = b c = d would be valid. This opens its own set of problems such as: a = b ++ c Which would most likely be parsed to be equivalent to: a = b++; c; Whereas with the spec it is: a = b; ++c; No clue how to fix that, yet. Maybe you could do something similar to what Python does, inserting newline tokens into the token stream and removing those between matched pairs of parenthesis. Cheers, Carl Friedrich
participants (3)
-
Carl Friedrich Bolz
-
Florian Schulze
-
Leonardo Santagada