Mailman 3 Suggestions for cssselect - lxml - The Python XML Toolkit

newer
How to report validation errors in...

Suggestions for cssselect

Simon Sapin

March 26, 2012

7:40 a.m.

Hi, I would like to make some changes in cssselect (and provide patches): 1. Change :link, :visited, :target, :hover, :active and :focus to never match (silently, or maybe with a warning?) instead of raising. They could translate to XPath [false] 2. Make :checked HTML-specific. (The implementation already is.) This would involve having a document_language parameter which would default to HTML in HTMLElement.cssselect and plain XML otherwise. :checked would raise on non-HTML documents. Having this parameter could eventually allow a language-specific (as in HTML vs. XML) implementation of :lang(). 3. Implement :enabled and :disabled similarly to :checked. Assuming 2, they would also be HTML-specific. (The docs mention :unchecked and :indeterminate but there is no such thing in the code or the selectors3 spec.) What do you think? Regards, -- Simon Sapin

Show replies by date

Stefan Behnel

March 2012

7:22 a.m.

Simon Sapin, 26.03.2012 16:40:

...

I would like to make some changes in cssselect (and provide patches):

1. Change :link, :visited, :target, :hover, :active and :focus to never match (silently, or maybe with a warning?) instead of raising. They could translate to XPath [false]

I think it would be good to make this configurable. A keyword argument like "ignore_link_classes" or "ignore_html_status_classes" could work here. Or maybe let the user pass the specific pseudo-classes as an argument "matching_html_status_classes"? By default, they would raise an expception in the parser (as they do now, right?). Then, the user can either pass True/False to set all of them to match or to not match, or pass a set/tuple/list of specific names that should match, thus automatically setting the remaining ones to not match.

...

2. Make :checked HTML-specific. (The implementation already is.) This would involve having a document_language parameter which would default to HTML in HTMLElement.cssselect and plain XML otherwise. :checked would raise on non-HTML documents. Having this parameter could eventually allow a language-specific (as in HTML vs. XML) implementation of :lang().

"checked" basically just maps to an attribute value, but even the spec definitely is rather HTML focussed. That makes it hopeless to implement this "correctly" in a "generic" way. I like the "document_language" parameter, because it would also allow to support other XML languages in the future. OTOH, we could be even more extensible by allowing users to pass in an arbitrary XPath condition for a given pseudo-class, as a plain string. Example: pseudo_classes = dict(hover='contains(., "active")') CSSSelect("a:hover", class_expressions=pseudo_classes) -> //a[contains(., "active")] In that context, the "document_language" option would simply select a specific configuration for these expressions. And we could also support rejecting a specific pseudo-class by passing a singleton value, e.g. pseudo_classes = dict(hover=cssselect.REJECT) and that would raise an exception in the parser. The REJECT could also be an arbitrary function that would either return an expression string or raise an error. Something like that...

...

3. Implement :enabled and :disabled similarly to :checked. Assuming 2, they would also be HTML-specific.

Yes, same case.

...

(The docs mention :unchecked and :indeterminate but there is no such thing in the code or the selectors3 spec.)

At least ":indeterminate" is in the spec: http://www.w3.org/TR/selectors/#UIstates And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation. Stefan

Simon Sapin

9:32 a.m.

Quick reactions to a few points. The rest is interesting, but requires more attention than I have right now. I’ll give it more thought. Le 31/03/2012 16:22, Stefan Behnel a écrit :

...

At least ":indeterminate" is in the spec:

http://www.w3.org/TR/selectors/#UIstates

I guess it was in a draft and then remove. The current rec has a section for it, but it says: "A future version of this specification may introduce an :indeterminate pseudo-class that applies to such elements. "

...

And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation.

Level 3 has :not(...), and it is implemented in cssselect. http://www.w3.org/TR/selectors/#negation :not(:checked) should work; although although it also selects "uncheckable" elements, not just "checkable" elements that are currently not checked. Regards, -- Simon Sapin

Laurence Rowe

5:21 p.m.

On 31 March 2012 15:22, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

I too agree that the whole css_to_xpath translation needs to be configurable. I made an attempt at this in https://github.com/lrowe/lxml/tree/cssselect-match-pattern (which also incorporates case insensitive :contains() using regex) using an options dict that got passed down through the various functions. It felt a little messy, but I think it would probably work well enough. With that it would be possible to optionally use exslt's str:tokenize for faster @class parsing or pass hints that there are <xsl:key> indexes for id/class/tagname to really speed things up when running in XSLT. (I then got a little sidetracked attempting to refactor it all to use ElementTree intermediate representations of the CSS and XPath expressions, but then you need to write the transform between the two representations and I dropped down the rabbit hole of writing a Python/ElementTree XSLT alternative...) Laurence

Simon Sapin

3:46 p.m.

Le 26/03/2012 16:40, Simon Sapin a écrit :

...

Actually, :link for HTML should translate to CSS a[href]

Stefan Behnel

10:30 p.m.

Simon Sapin, 01.04.2012 00:46:

...

Hmm, good idea. As should :visited, right? The :target class is a bit trickier. It might require a tag with an ID or on a[@name], but that sounds like it would hit way too many 'targets'. We may allow passing an actual URL into the evaluation and prepare for that with an XPath variable. Stefan

Simon Sapin

April 2012

8:43 a.m.

Hi, I’m taking things a bit out-of-order here, sorry about that. Overridable / configurable translation to XPath =============================================== Currently, the translation to XPath is tied to the selector parsing as it happens in xpath() methods of the parsed objects. I suggest that the first thing to do is to separate the implementation at least for the various pseudo-classes into a "translation" class/object. For example, instead of (roughly): method = '_xpath_' + self.ident.replace('-', '_') method = getattr(self, method) Pseudo.xpath() would contain: method = 'pseudo_class_' + self.ident.replace('-', '_') method = getattr(translation, method) ... and similar changes for functional pseudo-classes, maybe combinators, etc. There would be a default, but the user could provide a different translation that overrides or adds pseudo-classes. This mechanism could also enable different translations for different document languages. For example, we could decide that links do not exist in "generic" XML (and :link and :visited would never match) but links in HTML are a elements with an href attribute. A separated and overridable translation also allows the user to make different choices for everything else we’re discussing here. I think this is important as it means that more complex use cases can get by without forking or monkey-patching cssselect, whatever we decide here. Links ===== :link is actually for "links that have not yet been visited", so it is mutually exclusive with :visited. ":link, :visited" matches all links. In HTML, that is equivalent to "a[href]". I think that "nothing is visited" is a sane default. The pseudo-classes would then translate (CSS to XPath) as: :link → a[@href] :visited → [false] (or equivalent). If we want to really implement :visited, there could be a user-provided function/callable that takes an URL and returns True for visited, False for not visited. Conveniently, that callable could be some_set.__contains__ or similar. If such a callable is provided, the translations would be: :link → a[@href and not url_is_visited(@href)] :visited → a[@href and url_is_visited(@href)] Namespaces for functions vs. elements ===================================== By the way, is it possible to pass a "one time" user function for XPath? (ie. visible from only one compiled XPath object.) etree.FunctionNamespace looks like it registers function globally. (Visited links could arguably be the same globally, but I was also thinking of using functions for implementing :lang(), which is also based on external information like HTTP headers.) Also, is the prefix->URI mapping the same for elements and XPath functions? If I want to correctly implement @namespace, any prefix could be used in selectors. I don’t want these to overlap with function namespaces. Rejecting implemented selectors =============================== The spec is clear that unsupported selectors (say, the level 4 :matches() that we have not implemented yet) are invalid, and the whole (comma-separated) group of selectors should be invalid. Raising an exception for this (as it is done currently) is good. But once we *do* have an implementation for some pseudo-class, is there a reason to opt out of it and make it invalid again? Stephan suggested a REJECT marker, but I don’t understand the use case for this. Also, I think that never matching is a sane fallback for many pseudo-classes. (Eg: unless otherwise specified, there is no link and nothing is hovered.) But I’m not sure that making any of these always match makes sense. Is there a use case? HTML ==== :checked has a precise definition in the spec in terms of HTML, but that is only an example. The actual definition is more general than that (anything that can be toggled "on" by the user.) We could say that there is no such element in "generic" XML, so :checked could never match (but still be valid). Again, overridable translation could have a different implementation for both, and allow yet another one for a another document language. Extracting cssselect from lxml ============================== In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects: https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#... This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.) How would you feel about making cssselect an independent project, outside of lxml? lxml would still use it to provide convenience methods like HTMLElement.cssselect. Regards, -- Simon Sapin

Stefan Behnel

7:36 a.m.

Hi Simon, Simon Sapin, 02.04.2012 17:43:

...

Overridable / configurable translation to XPath ===============================================

Currently, the translation to XPath is tied to the selector parsing as it happens in xpath() methods of the parsed objects. I suggest that the first thing to do is to separate the implementation at least for the various pseudo-classes into a "translation" class/object.

For example, instead of (roughly):

method = '_xpath_' + self.ident.replace('-', '_') method = getattr(self, method)

Pseudo.xpath() would contain:

method = 'pseudo_class_' + self.ident.replace('-', '_') method = getattr(translation, method)

... and similar changes for functional pseudo-classes, maybe combinators, etc.

...

There would be a default, but the user could provide a different translation that overrides or adds pseudo-classes.

This mechanism could also enable different translations for different document languages. For example, we could decide that links do not exist in "generic" XML (and :link and :visited would never match) but links in HTML are a elements with an href attribute.

A separated and overridable translation also allows the user to make different choices for everything else we’re discussing here. I think this is important as it means that more complex use cases can get by without forking or monkey-patching cssselect, whatever we decide here.

Basically, this enables plain old subtyping to hook into the translation. Totally makes sense to me.

...

Links =====

:link is actually for "links that have not yet been visited", so it is mutually exclusive with :visited. ":link, :visited" matches all links. In HTML, that is equivalent to "a[href]".

I think that "nothing is visited" is a sane default. The pseudo-classes would then translate (CSS to XPath) as:

:link → a[@href] :visited → [false] (or equivalent).

If we want to really implement :visited, there could be a user-provided function/callable that takes an URL and returns True for visited, False for not visited. Conveniently, that callable could be some_set.__contains__ or similar. If such a callable is provided, the translations would be:

:link → a[@href and not url_is_visited(@href)] :visited → a[@href and url_is_visited(@href)]

Sure. That could just be a method in the translator that would either return None (by default) for a given pseudo class, or the prefixed name of a user provided function when overridden. The method that calls it would then either generate "[false]" or the expression above, and users could also override that outer method directly to get a totally different behaviour when they need it.

...

Namespaces for functions vs. elements =====================================

By the way, is it possible to pass a "one time" user function for XPath? (ie. visible from only one compiled XPath object.) etree.FunctionNamespace looks like it registers function globally.

Yes, that's the idea. However, you can pass functions into the XPath constructor. Assuming you won't need this for thousands of different functions, you can just create a couple of XPath instances with different configurations.

...

(Visited links could arguably be the same globally, but I was also thinking of using functions for implementing :lang(), which is also based on external information like HTTP headers.)

You can pass variables into the evaluator and use an XPath expression like this: match_urls = XPath('//a[contains(@href, $pattern)]') results = match_urls(element, pattern="'://lxml.de/'") The same works for user provided functions, obviously.

...

Also, is the prefix->URI mapping the same for elements and XPath functions? If I want to correctly implement @namespace, any prefix could be used in selectors. I don’t want these to overlap with function namespaces.

There could be a callback method in the translator that returns the mapping, or that looks up a prefix for a given namespace URI. Overriding that would allow users to modify the mapping in case of collisions. Apart from that, I'd just use prefix names that start with an underscore, maybe even "__lxml_ABC". That should be rare enough out there.

...

Rejecting implemented selectors ===============================

The spec is clear that unsupported selectors (say, the level 4 :matches() that we have not implemented yet) are invalid, and the whole (comma-separated) group of selectors should be invalid. Raising an exception for this (as it is done currently) is good.

+1, helps with future compatibility and makes it clear to users what is currently supported.

...

But once we *do* have an implementation for some pseudo-class, is there a reason to opt out of it and make it invalid again? Stephan suggested a REJECT marker, but I don’t understand the use case for this.

That was just an idea that would allow users to make sure a given marker never matches. It's a lot more generic to use a dedicated translator class.

...

Also, I think that never matching is a sane fallback for many pseudo-classes.

Definitely for those that do not have one obvious meaning.

...

(Eg: unless otherwise specified, there is no link and nothing is hovered.) But I’m not sure that making any of these always match makes sense. Is there a use case?

To be answered by others.

...

HTML ====

:checked has a precise definition in the spec in terms of HTML, but that is only an example. The actual definition is more general than that (anything that can be toggled "on" by the user.) We could say that there is no such element in "generic" XML, so :checked could never match (but still be valid). Again, overridable translation could have a different implementation for both, and allow yet another one for a another document language.

Absolutely.

...

Extracting cssselect from lxml ==============================

In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects:

https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#...

This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.)

How would you feel about making cssselect an independent project, outside of lxml?

Makes sense. It seems to be the one part of lxml that attracts serious external interest lately, and I can definitely see it being useful outside of lxml itself. Are you volunteering to take it over? Note that backwards compatibility is quite important, but it seems you are aware of that.

...

lxml would still use it to provide convenience methods like HTMLElement.cssselect.

Yes, it would just be a conditional import away and otherwise raise an exception at call time when the dependency is missing. Stefan

Simon Sapin

11:58 a.m.

Le 06/04/2012 16:36, Stefan Behnel a écrit :

...

Yes. My full time job currently consists mostly in developing WeasyPrint. I made tinycss as a part of *that*, and I can also allocate some time for cssselect. I also think I understand cssselect’s code fairly well. If you think this is the way to go, I can make a cssselect project on Github and PyPI that is (at first) the same as the current lxml.cssselect, only at a different import name. (I’ll try to preserve the commit history.) I can give push access there to whoever is interested. Then I’ll separate the XPath translation into a class as described in the previous message. From there we can discuss more precisely what to do and how. How does this all sounds? As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem. But more generally, we should decide what is: * Part of the public API, has backward-compatibility promises * or an implementation detail that can change. For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these? -- Simon Sapin

Stefan Behnel

2:46 a.m.

Simon Sapin, 08.04.2012 20:58:

...

Le 06/04/2012 16:36, Stefan Behnel a écrit :

...
...
Extracting cssselect from lxml ==============================

In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects:

https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#...

This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.)

How would you feel about making cssselect an independent project, outside of lxml? Makes sense. It seems to be the one part of lxml that attracts serious external interest lately, and I can definitely see it being useful outside of lxml itself.

Are you volunteering to take it over? Note that backwards compatibility is quite important, but it seems you are aware of that.

Yes. My full time job currently consists mostly in developing WeasyPrint. I made tinycss as a part of *that*, and I can also allocate some time for cssselect. I also think I understand cssselect’s code fairly well.

If you think this is the way to go, I can make a cssselect project on Github and PyPI

Please do. "cssselect" sounds like a good enough name to me. There's also the experimental fork by Laurence Rowe: http://pypi.python.org/pypi/experimental.cssselect You can find my comments on his changes in lxml's github issues list.

...

that is (at first) the same as the current lxml.cssselect, only at a different import name. (I’ll try to preserve the commit history.)

Hg, at least, has the convert extension, which allows selectively cloning repositories and moving files around while doing so. I'd expect git to have something similar. Ask back if you need help with this.

...

Then I’ll separate the XPath translation into a class as described in the previous message.

From there we can discuss more precisely what to do and how.

How does this all sounds?

Sounds good to me.

...

As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem.

Sure.

...

But more generally, we should decide what is:

* Part of the public API, has backward-compatibility promises * or an implementation detail that can change.

For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these?

In what way does it rely on them? I'm more concerned with the "obviously" public API and the conversion semantics. It would be bad to break a currently working CSS selector expression without a good reason, for example. Personally, I think the implementation of the parser is subject to change at any time, e.g. adding a new argument to the methods to pass down a context object is fine. The new way to extend the generation will be the right way to do it in the future, so any hacks that currently try to hook into it in undocumented ways are IMHO not worth keeping alive if that causes hassle. It will be easier to do the same things with the new architecture (and if it's not, it's the new architecture that should be extended, instead of keeping quirks from the old one). That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership. :-) Stefan

Simon Sapin

4:28 a.m.

Le 10/04/2012 11:46, Stefan Behnel a écrit :

...

There's also the experimental fork by Laurence Rowe:

http://pypi.python.org/pypi/experimental.cssselect

You can find my comments on his changes in lxml's github issues list.

I’ll look at it once I have something equivalent to lxml’s current master with passing tests.

...

...
...
But more generally, we should decide what is:

* Part of the public API, has backward-compatibility promises * or an implementation detail that can change.

For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these? In what way does it rely on them?

It manipulates the parsed objects (Pseudo, CombinedSelector, etc.) to: * Calculate the specificity * Split out pseudo-elements But I could move both of these into cssselect and avoid the question.

...

I'm more concerned with the "obviously" public API and the conversion semantics. It would be bad to break a currently working CSS selector expression without a good reason, for example.

Personally, I think the implementation of the parser is subject to change at any time, e.g. adding a new argument to the methods to pass down a context object is fine. The new way to extend the generation will be the right way to do it in the future, so any hacks that currently try to hook into it in undocumented ways are IMHO not worth keeping alive if that causes hassle. It will be easier to do the same things with the new architecture (and if it's not, it's the new architecture that should be extended, instead of keeping quirks from the old one).

Agreed.

...

That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership.:-)

Fair enough :) -- Simon

Simon Sapin

4:55 a.m.

New subject: cssselect: Python 2.4

I started something here but it is very much in progress: https://github.com/SimonSapin/cssselect Quick question: how important is Python 2.4 support for cssselect? 2.6 makes many things easier, especially for supporting 3.x with the same code base. Regards, -- Simon

Stefan Behnel

5:16 a.m.

New subject: cssselect: Python 2.4

Simon Sapin, 11.04.2012 13:55:

...

It worked before - why break it now?

...

2.6 makes many things easier, especially for supporting 3.x with the same code base.

That argument seems to stick in people's heads way too easily. Sure, it's /easier/, but it's not /difficult/ to stay compatible with older Python versions. There is usually a bit of boilerplate code involved, but once that's there, most things come pretty much for free. For example, you won't run into byte/unicode issues because everything is always going to be Unicode or plain ASCII characters. And since you won't use non-ASCII characters in the source code, you won't have a syntax issue with the 'u' prefix. Getting at exception objects is a bit more involved, and you can't use the "with" statement. Well. Both do not really make up the bulk of the code. In any case, going straight for 2.6 is way overshooting it. Even the latest Django release still supports Py2.5. Stefan

Simon Sapin

5:53 a.m.

New subject: cssselect: Python 2.4

Le 11/04/2012 14:16, Stefan Behnel a écrit :

...

Ok. I generally don’t bother with 2.5 or 2.4 for new projects, but as you said, it’s already there. I’ll make sure that cssselect’s tests pass in Python 2.4 and up. -- Simon

Stefan Behnel

March 2012

7:22 a.m.

Simon Sapin, 26.03.2012 16:40:

...

I would like to make some changes in cssselect (and provide patches):

1. Change :link, :visited, :target, :hover, :active and :focus to never match (silently, or maybe with a warning?) instead of raising. They could translate to XPath [false]

...

2. Make :checked HTML-specific. (The implementation already is.) This would involve having a document_language parameter which would default to HTML in HTMLElement.cssselect and plain XML otherwise. :checked would raise on non-HTML documents. Having this parameter could eventually allow a language-specific (as in HTML vs. XML) implementation of :lang().

...

3. Implement :enabled and :disabled similarly to :checked. Assuming 2, they would also be HTML-specific.

Yes, same case.

...

(The docs mention :unchecked and :indeterminate but there is no such thing in the code or the selectors3 spec.)

Simon Sapin

9:32 a.m.

Quick reactions to a few points. The rest is interesting, but requires more attention than I have right now. I’ll give it more thought. Le 31/03/2012 16:22, Stefan Behnel a écrit :

...

At least ":indeterminate" is in the spec:

http://www.w3.org/TR/selectors/#UIstates

...

And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation.

Laurence Rowe

5:21 p.m.

On 31 March 2012 15:22, Stefan Behnel <stefan_ml@behnel.de> wrote:

...

Simon Sapin

3:46 p.m.

Le 26/03/2012 16:40, Simon Sapin a écrit :

...

Actually, :link for HTML should translate to CSS a[href]

Stefan Behnel

10:30 p.m.

Simon Sapin, 01.04.2012 00:46:

...

Simon Sapin

April 2012

8:43 a.m.

Stefan Behnel

April 2012

7:36 a.m.

Hi Simon, Simon Sapin, 02.04.2012 17:43:

...

Overridable / configurable translation to XPath ===============================================

Currently, the translation to XPath is tied to the selector parsing as it happens in xpath() methods of the parsed objects. I suggest that the first thing to do is to separate the implementation at least for the various pseudo-classes into a "translation" class/object.

For example, instead of (roughly):

method = '_xpath_' + self.ident.replace('-', '_') method = getattr(self, method)

Pseudo.xpath() would contain:

method = 'pseudo_class_' + self.ident.replace('-', '_') method = getattr(translation, method)

... and similar changes for functional pseudo-classes, maybe combinators, etc.

...

There would be a default, but the user could provide a different translation that overrides or adds pseudo-classes.

This mechanism could also enable different translations for different document languages. For example, we could decide that links do not exist in "generic" XML (and :link and :visited would never match) but links in HTML are a elements with an href attribute.

A separated and overridable translation also allows the user to make different choices for everything else we’re discussing here. I think this is important as it means that more complex use cases can get by without forking or monkey-patching cssselect, whatever we decide here.

Basically, this enables plain old subtyping to hook into the translation. Totally makes sense to me.

...

Links =====

:link is actually for "links that have not yet been visited", so it is mutually exclusive with :visited. ":link, :visited" matches all links. In HTML, that is equivalent to "a[href]".

I think that "nothing is visited" is a sane default. The pseudo-classes would then translate (CSS to XPath) as:

:link → a[@href] :visited → [false] (or equivalent).

If we want to really implement :visited, there could be a user-provided function/callable that takes an URL and returns True for visited, False for not visited. Conveniently, that callable could be some_set.__contains__ or similar. If such a callable is provided, the translations would be:

:link → a[@href and not url_is_visited(@href)] :visited → a[@href and url_is_visited(@href)]

...

Namespaces for functions vs. elements =====================================

By the way, is it possible to pass a "one time" user function for XPath? (ie. visible from only one compiled XPath object.) etree.FunctionNamespace looks like it registers function globally.

...

(Visited links could arguably be the same globally, but I was also thinking of using functions for implementing :lang(), which is also based on external information like HTTP headers.)

...

Also, is the prefix->URI mapping the same for elements and XPath functions? If I want to correctly implement @namespace, any prefix could be used in selectors. I don’t want these to overlap with function namespaces.

...

Rejecting implemented selectors ===============================

The spec is clear that unsupported selectors (say, the level 4 :matches() that we have not implemented yet) are invalid, and the whole (comma-separated) group of selectors should be invalid. Raising an exception for this (as it is done currently) is good.

+1, helps with future compatibility and makes it clear to users what is currently supported.

...

But once we *do* have an implementation for some pseudo-class, is there a reason to opt out of it and make it invalid again? Stephan suggested a REJECT marker, but I don’t understand the use case for this.

That was just an idea that would allow users to make sure a given marker never matches. It's a lot more generic to use a dedicated translator class.

...

Also, I think that never matching is a sane fallback for many pseudo-classes.

Definitely for those that do not have one obvious meaning.

...

(Eg: unless otherwise specified, there is no link and nothing is hovered.) But I’m not sure that making any of these always match makes sense. Is there a use case?

To be answered by others.

...

HTML ====

:checked has a precise definition in the spec in terms of HTML, but that is only an example. The actual definition is more general than that (anything that can be toggled "on" by the user.) We could say that there is no such element in "generic" XML, so :checked could never match (but still be valid). Again, overridable translation could have a different implementation for both, and allow yet another one for a another document language.

Absolutely.

...

Extracting cssselect from lxml ==============================

In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects:

https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#...

This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.)

How would you feel about making cssselect an independent project, outside of lxml?

...

lxml would still use it to provide convenience methods like HTMLElement.cssselect.

Yes, it would just be a conditional import away and otherwise raise an exception at call time when the dependency is missing. Stefan

Simon Sapin

11:58 a.m.

Le 06/04/2012 16:36, Stefan Behnel a écrit :

...

Stefan Behnel

2:46 a.m.

Simon Sapin, 08.04.2012 20:58:

...

Le 06/04/2012 16:36, Stefan Behnel a écrit :

...
...
Extracting cssselect from lxml ==============================

In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects:

https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#...

This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.)

How would you feel about making cssselect an independent project, outside of lxml? Makes sense. It seems to be the one part of lxml that attracts serious external interest lately, and I can definitely see it being useful outside of lxml itself.

Are you volunteering to take it over? Note that backwards compatibility is quite important, but it seems you are aware of that.

Yes. My full time job currently consists mostly in developing WeasyPrint. I made tinycss as a part of *that*, and I can also allocate some time for cssselect. I also think I understand cssselect’s code fairly well.

If you think this is the way to go, I can make a cssselect project on Github and PyPI

...

that is (at first) the same as the current lxml.cssselect, only at a different import name. (I’ll try to preserve the commit history.)

...

Then I’ll separate the XPath translation into a class as described in the previous message.

From there we can discuss more precisely what to do and how.

How does this all sounds?

Sounds good to me.

...

As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem.

Sure.

...

But more generally, we should decide what is:

* Part of the public API, has backward-compatibility promises * or an implementation detail that can change.

For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these?

Simon Sapin

4:28 a.m.

Le 10/04/2012 11:46, Stefan Behnel a écrit :

...

There's also the experimental fork by Laurence Rowe:

http://pypi.python.org/pypi/experimental.cssselect

You can find my comments on his changes in lxml's github issues list.

I’ll look at it once I have something equivalent to lxml’s current master with passing tests.

...

...
...
But more generally, we should decide what is:

* Part of the public API, has backward-compatibility promises * or an implementation detail that can change.

For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these? In what way does it rely on them?

It manipulates the parsed objects (Pseudo, CombinedSelector, etc.) to: * Calculate the specificity * Split out pseudo-elements But I could move both of these into cssselect and avoid the question.

...

I'm more concerned with the "obviously" public API and the conversion semantics. It would be bad to break a currently working CSS selector expression without a good reason, for example.

Personally, I think the implementation of the parser is subject to change at any time, e.g. adding a new argument to the methods to pass down a context object is fine. The new way to extend the generation will be the right way to do it in the future, so any hacks that currently try to hook into it in undocumented ways are IMHO not worth keeping alive if that causes hassle. It will be easier to do the same things with the new architecture (and if it's not, it's the new architecture that should be extended, instead of keeping quirks from the old one).

Agreed.

...

That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership.:-)

Fair enough :) -- Simon

Simon Sapin

4:55 a.m.

New subject: cssselect: Python 2.4

Stefan Behnel

5:16 a.m.

New subject: cssselect: Python 2.4

Simon Sapin, 11.04.2012 13:55:

...

It worked before - why break it now?

...

2.6 makes many things easier, especially for supporting 3.x with the same code base.

Simon Sapin

April 2012

12:53 p.m.

New subject: cssselect: Python 2.4

Le 11/04/2012 14:16, Stefan Behnel a écrit :

...

Ok. I generally don’t bother with 2.5 or 2.4 for new projects, but as you said, it’s already there. I’ll make sure that cssselect’s tests pass in Python 2.4 and up. -- Simon

4701

Age (days ago)

4717

Last active (days ago)

List overview

Download

13 comments

3 participants

participants (3)

Laurence Rowe
Simon Sapin
Stefan Behnel

Suggestions for cssselect

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

Simon Sapin

tags

participants (3)