cssselect tokenizer and parser

Hi, I’ve been reviewing the parser in cssselect and fixing a lot of bugs, as you can see in recent commits[1]. I’m about to make more significant changes to the tokenizer and the parser to fix the underlying problem behind lxml bug #19 [2] and many related bugs. In short, the tokenizer omits most whitespace, including some that represent a descendant combinator. To compensate, the parser just assumes a descendant combinator (sometimes incorrectly) whenever it can not find something else that it expected. So I’m going to touch most of this code, but I’m still a bit uneasy with it. There are many corner cases where it does not matches the grammar in the spec[3]. In particular, the behavior in arguments for functional pseudo-classes it a bit strange, if not plain wrong. I would like to write a tokenizer+parser more closely based on the grammar. Now, I just happen to have in tinycss[4] a CSS tokenizer that already handles all the gory details of backslash-escaping, parsing numbers, etc. I actually have two implementations of it: one in pure Python for compatibility, and one in Cython for speed. (The tokenizer performance is critical when parsing whole stylesheets; perhaps less so for comparatively small selectors.) However, tinycss currently only supports Python 2.6+. So, what do you think of this whole situation? Should we re-use the tinycss tokenizer? Duplicate its code? Try and fix the buggy one? [1] https://github.com/SimonSapin/cssselect/commits/master [2] https://github.com/lxml/lxml/pull/19/files [3] http://www.w3.org/TR/selectors/#w3cselgrammar [4] http://packages.python.org/tinycss/ Regards, -- Simon Sapin

Simon Sapin, 17.04.2012 11:43:
I’ve been reviewing the parser in cssselect and fixing a lot of bugs, as you can see in recent commits[1].
Cool.
I’m about to make more significant changes to the tokenizer and the parser to fix the underlying problem behind lxml bug #19 [2] and many related bugs. In short, the tokenizer omits most whitespace, including some that represent a descendant combinator. To compensate, the parser just assumes a descendant combinator (sometimes incorrectly) whenever it can not find something else that it expected.
So I’m going to touch most of this code, but I’m still a bit uneasy with it. There are many corner cases where it does not matches the grammar in the spec[3]. In particular, the behavior in arguments for functional pseudo-classes it a bit strange, if not plain wrong.
Any fixed bug is one bug less.
I would like to write a tokenizer+parser more closely based on the grammar. Now, I just happen to have in tinycss[4] a CSS tokenizer that already handles all the gory details of backslash-escaping, parsing numbers, etc.
So, you are proposing to replace a part of the parser in cssselect with a part of tinycss? If it makes the tool better, why not. I don't see much use in writing a fix for something when working code is already there. Basically, as long as it works (more) correctly and comes under the same terms of use (read: license), I don't see a problem.
I actually have two implementations of it: one in pure Python for compatibility, and one in Cython for speed. (The tokenizer performance is critical when parsing whole stylesheets; perhaps less so for comparatively small selectors.)
I took a quick look at the Cython based parser. You might get another little boost by typing "css_source" as unicode. Also, is there a reason why you repeatedly throw regexes at the source, instead of using a single regex with multiple groups?
However, tinycss currently only supports Python 2.6+.
I guess the main (or only?) reason for that is the unicode_literals future import, right? I know, it doesn't make the code more beautiful, but you could use constants instead of literals, and wrap them in the obvious u() function that decodes them from the source code encoding in Py2. Apart from that, I can't really see anything obvious that wouldn't work in 2.4. Stefan

Le 17/04/2012 20:43, Stefan Behnel a écrit :
Simon Sapin, 17.04.2012 11:43:
I would like to write a tokenizer+parser more closely based on the grammar. Now, I just happen to have in tinycss[4] a CSS tokenizer that already handles all the gory details of backslash-escaping, parsing numbers, etc.
So, you are proposing to replace a part of the parser in cssselect with a part of tinycss? If it makes the tool better, why not. I don't see much use in writing a fix for something when working code is already there.
Basically, as long as it works (more) correctly and comes under the same terms of use (read: license), I don't see a problem.
The license is the same (BSD). Now that the descendant selector bug are fixed (unless I missed something) the remaining issues that I see are: 1. The current tokenizer for Symbol uses something like the '\w' regex, while a CSS IDENT token can contain any non-ASCII character (including U+00A0 no-break space, for example), can have backslash-escapes but can not start with a digit. 2. Unicode white space (like U+00A0) counts as white space (either ignored or a descendant combinator) but should not (related to 1) 3. 2n+1 or similar strings (arguments to :nth-child()) are tokenized as Symbol objects, and are then accepted by the parser as element types, class names, IDs, etc. I *think* that any valid (for CSS) selector that only uses ASCII without backslash-escapes should be fine now, so maybe this is not really a problem ...
I actually have two implementations of it: one in pure Python for compatibility, and one in Cython for speed. (The tokenizer performance is critical when parsing whole stylesheets; perhaps less so for comparatively small selectors.)
I took a quick look at the Cython based parser. You might get another little boost by typing "css_source" as unicode.
I’ll try that.
Also, is there a reason why you repeatedly throw regexes at the source, instead of using a single regex with multiple groups?
Everything in a single big regex, and then see which groups are None to detect the token types? I hadn’t though about doing that, but I’ll try it. There would still be one call to .match() for each token, right? The spec has one regex for each token type, and the initial implementation was a very straight forward translation of that. (Although I don’t use Yacc.)
However, tinycss currently only supports Python 2.6+.
I guess the main (or only?) reason for that is the unicode_literals future import, right? I know, it doesn't make the code more beautiful, but you could use constants instead of literals, and wrap them in the obvious u() function that decodes them from the source code encoding in Py2. Apart from that, I can't really see anything obvious that wouldn't work in 2.4.
Yes, it could be ported. There just hasn’t been any interest in doing so before. -- Simon Sapin

On 17 April 2012 10:43, Simon Sapin <simon.sapin@kozea.fr> wrote:
Hi,
I’ve been reviewing the parser in cssselect and fixing a lot of bugs, as you can see in recent commits[1].
I’m about to make more significant changes to the tokenizer and the parser to fix the underlying problem behind lxml bug #19 [2] and many related bugs. In short, the tokenizer omits most whitespace, including some that represent a descendant combinator. To compensate, the parser just assumes a descendant combinator (sometimes incorrectly) whenever it can not find something else that it expected.
So I’m going to touch most of this code, but I’m still a bit uneasy with it. There are many corner cases where it does not matches the grammar in the spec[3]. In particular, the behavior in arguments for functional pseudo-classes it a bit strange, if not plain wrong.
I would like to write a tokenizer+parser more closely based on the grammar. Now, I just happen to have in tinycss[4] a CSS tokenizer that already handles all the gory details of backslash-escaping, parsing numbers, etc. I actually have two implementations of it: one in pure Python for compatibility, and one in Cython for speed. (The tokenizer performance is critical when parsing whole stylesheets; perhaps less so for comparatively small selectors.) However, tinycss currently only supports Python 2.6+.
In that case, would it make sense to just fold the cssselect functionality into tinycss itself? Admittedly that is more code to make 2.4 compatible. Laurence

Le 17/04/2012 22:13, Laurence Rowe a écrit :
In that case, would it make sense to just fold the cssselect functionality into tinycss itself? Admittedly that is more code to make 2.4 compatible.
Yes, merging the two projects is an option. Especially if they mutually depend on each other. I’ll keep the somewhat-broken tokenizer for now and see how much of a problem it is in practice. (I had been using the "old" lxml.cssselect for some time already after all.) -- Simon Sapin
participants (3)
-
Laurence Rowe
-
Simon Sapin
-
Stefan Behnel