data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Hi, I would like to make some changes in cssselect (and provide patches): 1. Change :link, :visited, :target, :hover, :active and :focus to never match (silently, or maybe with a warning?) instead of raising. They could translate to XPath [false] 2. Make :checked HTML-specific. (The implementation already is.) This would involve having a document_language parameter which would default to HTML in HTMLElement.cssselect and plain XML otherwise. :checked would raise on non-HTML documents. Having this parameter could eventually allow a language-specific (as in HTML vs. XML) implementation of :lang(). 3. Implement :enabled and :disabled similarly to :checked. Assuming 2, they would also be HTML-specific. (The docs mention :unchecked and :indeterminate but there is no such thing in the code or the selectors3 spec.) What do you think? Regards, -- Simon Sapin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 26.03.2012 16:40:
I think it would be good to make this configurable. A keyword argument like "ignore_link_classes" or "ignore_html_status_classes" could work here. Or maybe let the user pass the specific pseudo-classes as an argument "matching_html_status_classes"? By default, they would raise an expception in the parser (as they do now, right?). Then, the user can either pass True/False to set all of them to match or to not match, or pass a set/tuple/list of specific names that should match, thus automatically setting the remaining ones to not match.
"checked" basically just maps to an attribute value, but even the spec definitely is rather HTML focussed. That makes it hopeless to implement this "correctly" in a "generic" way. I like the "document_language" parameter, because it would also allow to support other XML languages in the future. OTOH, we could be even more extensible by allowing users to pass in an arbitrary XPath condition for a given pseudo-class, as a plain string. Example: pseudo_classes = dict(hover='contains(., "active")') CSSSelect("a:hover", class_expressions=pseudo_classes) -> //a[contains(., "active")] In that context, the "document_language" option would simply select a specific configuration for these expressions. And we could also support rejecting a specific pseudo-class by passing a singleton value, e.g. pseudo_classes = dict(hover=cssselect.REJECT) and that would raise an exception in the parser. The REJECT could also be an arbitrary function that would either return an expression string or raise an error. Something like that...
3. Implement :enabled and :disabled similarly to :checked. Assuming 2, they would also be HTML-specific.
Yes, same case.
(The docs mention :unchecked and :indeterminate but there is no such thing in the code or the selectors3 spec.)
At least ":indeterminate" is in the spec: http://www.w3.org/TR/selectors/#UIstates And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation. Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Quick reactions to a few points. The rest is interesting, but requires more attention than I have right now. I’ll give it more thought. Le 31/03/2012 16:22, Stefan Behnel a écrit :
At least ":indeterminate" is in the spec:
I guess it was in a draft and then remove. The current rec has a section for it, but it says: "A future version of this specification may introduce an :indeterminate pseudo-class that applies to such elements. "
And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation.
Level 3 has :not(...), and it is implemented in cssselect. http://www.w3.org/TR/selectors/#negation :not(:checked) should work; although although it also selects "uncheckable" elements, not just "checkable" elements that are currently not checked. Regards, -- Simon Sapin
data:image/s3,"s3://crabby-images/ab69b/ab69beddc1396be52e2c3fc5bdf95de6cc0e575c" alt=""
On 31 March 2012 15:22, Stefan Behnel <stefan_ml@behnel.de> wrote:
I too agree that the whole css_to_xpath translation needs to be configurable. I made an attempt at this in https://github.com/lrowe/lxml/tree/cssselect-match-pattern (which also incorporates case insensitive :contains() using regex) using an options dict that got passed down through the various functions. It felt a little messy, but I think it would probably work well enough. With that it would be possible to optionally use exslt's str:tokenize for faster @class parsing or pass hints that there are <xsl:key> indexes for id/class/tagname to really speed things up when running in XSLT. (I then got a little sidetracked attempting to refactor it all to use ElementTree intermediate representations of the CSS and XPath expressions, but then you need to write the transform between the two representations and I dropped down the rabbit hole of writing a Python/ElementTree XSLT alternative...) Laurence
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 01.04.2012 00:46:
Hmm, good idea. As should :visited, right? The :target class is a bit trickier. It might require a tag with an ID or on a[@name], but that sounds like it would hit way too many 'targets'. We may allow passing an actual URL into the evaluation and prepare for that with an XPath variable. Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Hi, I’m taking things a bit out-of-order here, sorry about that. Overridable / configurable translation to XPath =============================================== Currently, the translation to XPath is tied to the selector parsing as it happens in xpath() methods of the parsed objects. I suggest that the first thing to do is to separate the implementation at least for the various pseudo-classes into a "translation" class/object. For example, instead of (roughly): method = '_xpath_' + self.ident.replace('-', '_') method = getattr(self, method) Pseudo.xpath() would contain: method = 'pseudo_class_' + self.ident.replace('-', '_') method = getattr(translation, method) ... and similar changes for functional pseudo-classes, maybe combinators, etc. There would be a default, but the user could provide a different translation that overrides or adds pseudo-classes. This mechanism could also enable different translations for different document languages. For example, we could decide that links do not exist in "generic" XML (and :link and :visited would never match) but links in HTML are a elements with an href attribute. A separated and overridable translation also allows the user to make different choices for everything else we’re discussing here. I think this is important as it means that more complex use cases can get by without forking or monkey-patching cssselect, whatever we decide here. Links ===== :link is actually for "links that have not yet been visited", so it is mutually exclusive with :visited. ":link, :visited" matches all links. In HTML, that is equivalent to "a[href]". I think that "nothing is visited" is a sane default. The pseudo-classes would then translate (CSS to XPath) as: :link → a[@href] :visited → [false] (or equivalent). If we want to really implement :visited, there could be a user-provided function/callable that takes an URL and returns True for visited, False for not visited. Conveniently, that callable could be some_set.__contains__ or similar. If such a callable is provided, the translations would be: :link → a[@href and not url_is_visited(@href)] :visited → a[@href and url_is_visited(@href)] Namespaces for functions vs. elements ===================================== By the way, is it possible to pass a "one time" user function for XPath? (ie. visible from only one compiled XPath object.) etree.FunctionNamespace looks like it registers function globally. (Visited links could arguably be the same globally, but I was also thinking of using functions for implementing :lang(), which is also based on external information like HTTP headers.) Also, is the prefix->URI mapping the same for elements and XPath functions? If I want to correctly implement @namespace, any prefix could be used in selectors. I don’t want these to overlap with function namespaces. Rejecting implemented selectors =============================== The spec is clear that unsupported selectors (say, the level 4 :matches() that we have not implemented yet) are invalid, and the whole (comma-separated) group of selectors should be invalid. Raising an exception for this (as it is done currently) is good. But once we *do* have an implementation for some pseudo-class, is there a reason to opt out of it and make it invalid again? Stephan suggested a REJECT marker, but I don’t understand the use case for this. Also, I think that never matching is a sane fallback for many pseudo-classes. (Eg: unless otherwise specified, there is no link and nothing is hovered.) But I’m not sure that making any of these always match makes sense. Is there a use case? HTML ==== :checked has a precise definition in the spec in terms of HTML, but that is only an example. The actual definition is more general than that (anything that can be toggled "on" by the user.) We could say that there is no such element in "generic" XML, so :checked could never match (but still be valid). Again, overridable translation could have a different implementation for both, and allow yet another one for a another document language. Extracting cssselect from lxml ============================== In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects: https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#... This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.) How would you feel about making cssselect an independent project, outside of lxml? lxml would still use it to provide convenience methods like HTMLElement.cssselect. Regards, -- Simon Sapin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Simon, Simon Sapin, 02.04.2012 17:43:
+1
Basically, this enables plain old subtyping to hook into the translation. Totally makes sense to me.
Sure. That could just be a method in the translator that would either return None (by default) for a given pseudo class, or the prefixed name of a user provided function when overridden. The method that calls it would then either generate "[false]" or the expression above, and users could also override that outer method directly to get a totally different behaviour when they need it.
Yes, that's the idea. However, you can pass functions into the XPath constructor. Assuming you won't need this for thousands of different functions, you can just create a couple of XPath instances with different configurations.
You can pass variables into the evaluator and use an XPath expression like this: match_urls = XPath('//a[contains(@href, $pattern)]') results = match_urls(element, pattern="'://lxml.de/'") The same works for user provided functions, obviously.
There could be a callback method in the translator that returns the mapping, or that looks up a prefix for a given namespace URI. Overriding that would allow users to modify the mapping in case of collisions. Apart from that, I'd just use prefix names that start with an underscore, maybe even "__lxml_ABC". That should be rare enough out there.
+1, helps with future compatibility and makes it clear to users what is currently supported.
That was just an idea that would allow users to make sure a given marker never matches. It's a lot more generic to use a dedicated translator class.
Also, I think that never matching is a sane fallback for many pseudo-classes.
Definitely for those that do not have one obvious meaning.
To be answered by others.
Absolutely.
Makes sense. It seems to be the one part of lxml that attracts serious external interest lately, and I can definitely see it being useful outside of lxml itself. Are you volunteering to take it over? Note that backwards compatibility is quite important, but it seems you are aware of that.
lxml would still use it to provide convenience methods like HTMLElement.cssselect.
Yes, it would just be a conditional import away and otherwise raise an exception at call time when the dependency is missing. Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Le 06/04/2012 16:36, Stefan Behnel a écrit :
Yes. My full time job currently consists mostly in developing WeasyPrint. I made tinycss as a part of *that*, and I can also allocate some time for cssselect. I also think I understand cssselect’s code fairly well. If you think this is the way to go, I can make a cssselect project on Github and PyPI that is (at first) the same as the current lxml.cssselect, only at a different import name. (I’ll try to preserve the commit history.) I can give push access there to whoever is interested. Then I’ll separate the XPath translation into a class as described in the previous message. From there we can discuss more precisely what to do and how. How does this all sounds? As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem. But more generally, we should decide what is: * Part of the public API, has backward-compatibility promises * or an implementation detail that can change. For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these? -- Simon Sapin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 08.04.2012 20:58:
Please do. "cssselect" sounds like a good enough name to me. There's also the experimental fork by Laurence Rowe: http://pypi.python.org/pypi/experimental.cssselect You can find my comments on his changes in lxml's github issues list.
Hg, at least, has the convert extension, which allows selectively cloning repositories and moving files around while doing so. I'd expect git to have something similar. Ask back if you need help with this.
Sounds good to me.
As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem.
Sure.
In what way does it rely on them? I'm more concerned with the "obviously" public API and the conversion semantics. It would be bad to break a currently working CSS selector expression without a good reason, for example. Personally, I think the implementation of the parser is subject to change at any time, e.g. adding a new argument to the methods to pass down a context object is fine. The new way to extend the generation will be the right way to do it in the future, so any hacks that currently try to hook into it in undocumented ways are IMHO not worth keeping alive if that causes hassle. It will be easier to do the same things with the new architecture (and if it's not, it's the new architecture that should be extended, instead of keeping quirks from the old one). That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership. :-) Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Le 10/04/2012 11:46, Stefan Behnel a écrit :
I’ll look at it once I have something equivalent to lxml’s current master with passing tests.
It manipulates the parsed objects (Pseudo, CombinedSelector, etc.) to: * Calculate the specificity * Split out pseudo-elements But I could move both of these into cssselect and avoid the question.
Agreed.
That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership.:-)
Fair enough :) -- Simon
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
I started something here but it is very much in progress: https://github.com/SimonSapin/cssselect Quick question: how important is Python 2.4 support for cssselect? 2.6 makes many things easier, especially for supporting 3.x with the same code base. Regards, -- Simon
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 11.04.2012 13:55:
It worked before - why break it now?
2.6 makes many things easier, especially for supporting 3.x with the same code base.
That argument seems to stick in people's heads way too easily. Sure, it's /easier/, but it's not /difficult/ to stay compatible with older Python versions. There is usually a bit of boilerplate code involved, but once that's there, most things come pretty much for free. For example, you won't run into byte/unicode issues because everything is always going to be Unicode or plain ASCII characters. And since you won't use non-ASCII characters in the source code, you won't have a syntax issue with the 'u' prefix. Getting at exception objects is a bit more involved, and you can't use the "with" statement. Well. Both do not really make up the bulk of the code. In any case, going straight for 2.6 is way overshooting it. Even the latest Django release still supports Py2.5. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 26.03.2012 16:40:
I think it would be good to make this configurable. A keyword argument like "ignore_link_classes" or "ignore_html_status_classes" could work here. Or maybe let the user pass the specific pseudo-classes as an argument "matching_html_status_classes"? By default, they would raise an expception in the parser (as they do now, right?). Then, the user can either pass True/False to set all of them to match or to not match, or pass a set/tuple/list of specific names that should match, thus automatically setting the remaining ones to not match.
"checked" basically just maps to an attribute value, but even the spec definitely is rather HTML focussed. That makes it hopeless to implement this "correctly" in a "generic" way. I like the "document_language" parameter, because it would also allow to support other XML languages in the future. OTOH, we could be even more extensible by allowing users to pass in an arbitrary XPath condition for a given pseudo-class, as a plain string. Example: pseudo_classes = dict(hover='contains(., "active")') CSSSelect("a:hover", class_expressions=pseudo_classes) -> //a[contains(., "active")] In that context, the "document_language" option would simply select a specific configuration for these expressions. And we could also support rejecting a specific pseudo-class by passing a singleton value, e.g. pseudo_classes = dict(hover=cssselect.REJECT) and that would raise an exception in the parser. The REJECT could also be an arbitrary function that would either return an expression string or raise an error. Something like that...
3. Implement :enabled and :disabled similarly to :checked. Assuming 2, they would also be HTML-specific.
Yes, same case.
(The docs mention :unchecked and :indeterminate but there is no such thing in the code or the selectors3 spec.)
At least ":indeterminate" is in the spec: http://www.w3.org/TR/selectors/#UIstates And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation. Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Quick reactions to a few points. The rest is interesting, but requires more attention than I have right now. I’ll give it more thought. Le 31/03/2012 16:22, Stefan Behnel a écrit :
At least ":indeterminate" is in the spec:
I guess it was in a draft and then remove. The current rec has a section for it, but it says: "A future version of this specification may introduce an :indeterminate pseudo-class that applies to such elements. "
And ":unchecked" makes sense if you allow ":checked". AFAICT, CSS selectors don't support the concept of an operator negation.
Level 3 has :not(...), and it is implemented in cssselect. http://www.w3.org/TR/selectors/#negation :not(:checked) should work; although although it also selects "uncheckable" elements, not just "checkable" elements that are currently not checked. Regards, -- Simon Sapin
data:image/s3,"s3://crabby-images/ab69b/ab69beddc1396be52e2c3fc5bdf95de6cc0e575c" alt=""
On 31 March 2012 15:22, Stefan Behnel <stefan_ml@behnel.de> wrote:
I too agree that the whole css_to_xpath translation needs to be configurable. I made an attempt at this in https://github.com/lrowe/lxml/tree/cssselect-match-pattern (which also incorporates case insensitive :contains() using regex) using an options dict that got passed down through the various functions. It felt a little messy, but I think it would probably work well enough. With that it would be possible to optionally use exslt's str:tokenize for faster @class parsing or pass hints that there are <xsl:key> indexes for id/class/tagname to really speed things up when running in XSLT. (I then got a little sidetracked attempting to refactor it all to use ElementTree intermediate representations of the CSS and XPath expressions, but then you need to write the transform between the two representations and I dropped down the rabbit hole of writing a Python/ElementTree XSLT alternative...) Laurence
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 01.04.2012 00:46:
Hmm, good idea. As should :visited, right? The :target class is a bit trickier. It might require a tag with an ID or on a[@name], but that sounds like it would hit way too many 'targets'. We may allow passing an actual URL into the evaluation and prepare for that with an XPath variable. Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Hi, I’m taking things a bit out-of-order here, sorry about that. Overridable / configurable translation to XPath =============================================== Currently, the translation to XPath is tied to the selector parsing as it happens in xpath() methods of the parsed objects. I suggest that the first thing to do is to separate the implementation at least for the various pseudo-classes into a "translation" class/object. For example, instead of (roughly): method = '_xpath_' + self.ident.replace('-', '_') method = getattr(self, method) Pseudo.xpath() would contain: method = 'pseudo_class_' + self.ident.replace('-', '_') method = getattr(translation, method) ... and similar changes for functional pseudo-classes, maybe combinators, etc. There would be a default, but the user could provide a different translation that overrides or adds pseudo-classes. This mechanism could also enable different translations for different document languages. For example, we could decide that links do not exist in "generic" XML (and :link and :visited would never match) but links in HTML are a elements with an href attribute. A separated and overridable translation also allows the user to make different choices for everything else we’re discussing here. I think this is important as it means that more complex use cases can get by without forking or monkey-patching cssselect, whatever we decide here. Links ===== :link is actually for "links that have not yet been visited", so it is mutually exclusive with :visited. ":link, :visited" matches all links. In HTML, that is equivalent to "a[href]". I think that "nothing is visited" is a sane default. The pseudo-classes would then translate (CSS to XPath) as: :link → a[@href] :visited → [false] (or equivalent). If we want to really implement :visited, there could be a user-provided function/callable that takes an URL and returns True for visited, False for not visited. Conveniently, that callable could be some_set.__contains__ or similar. If such a callable is provided, the translations would be: :link → a[@href and not url_is_visited(@href)] :visited → a[@href and url_is_visited(@href)] Namespaces for functions vs. elements ===================================== By the way, is it possible to pass a "one time" user function for XPath? (ie. visible from only one compiled XPath object.) etree.FunctionNamespace looks like it registers function globally. (Visited links could arguably be the same globally, but I was also thinking of using functions for implementing :lang(), which is also based on external information like HTTP headers.) Also, is the prefix->URI mapping the same for elements and XPath functions? If I want to correctly implement @namespace, any prefix could be used in selectors. I don’t want these to overlap with function namespaces. Rejecting implemented selectors =============================== The spec is clear that unsupported selectors (say, the level 4 :matches() that we have not implemented yet) are invalid, and the whole (comma-separated) group of selectors should be invalid. Raising an exception for this (as it is done currently) is good. But once we *do* have an implementation for some pseudo-class, is there a reason to opt out of it and make it invalid again? Stephan suggested a REJECT marker, but I don’t understand the use case for this. Also, I think that never matching is a sane fallback for many pseudo-classes. (Eg: unless otherwise specified, there is no link and nothing is hovered.) But I’m not sure that making any of these always match makes sense. Is there a use case? HTML ==== :checked has a precise definition in the spec in terms of HTML, but that is only an example. The actual definition is more general than that (anything that can be toggled "on" by the user.) We could say that there is no such element in "generic" XML, so :checked could never match (but still be valid). Again, overridable translation could have a different implementation for both, and allow yet another one for a another document language. Extracting cssselect from lxml ============================== In tinycss (my new CSS parser, a replacement of cssutils for WeasyPrint), I implemented the selector specificity and extracting of pseudo-elements based on parsed cssselect objects: https://github.com/SimonSapin/tinycss/blob/55b26cd22f/tinycss/selectors3.py#... This means that the whole module depends on lxml and thus does not run on PyPy, although the parts I really use are pure-Python. (Of course Selector.match would always depend on lxml, but that is more of a nice bonus.) How would you feel about making cssselect an independent project, outside of lxml? lxml would still use it to provide convenience methods like HTMLElement.cssselect. Regards, -- Simon Sapin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Hi Simon, Simon Sapin, 02.04.2012 17:43:
+1
Basically, this enables plain old subtyping to hook into the translation. Totally makes sense to me.
Sure. That could just be a method in the translator that would either return None (by default) for a given pseudo class, or the prefixed name of a user provided function when overridden. The method that calls it would then either generate "[false]" or the expression above, and users could also override that outer method directly to get a totally different behaviour when they need it.
Yes, that's the idea. However, you can pass functions into the XPath constructor. Assuming you won't need this for thousands of different functions, you can just create a couple of XPath instances with different configurations.
You can pass variables into the evaluator and use an XPath expression like this: match_urls = XPath('//a[contains(@href, $pattern)]') results = match_urls(element, pattern="'://lxml.de/'") The same works for user provided functions, obviously.
There could be a callback method in the translator that returns the mapping, or that looks up a prefix for a given namespace URI. Overriding that would allow users to modify the mapping in case of collisions. Apart from that, I'd just use prefix names that start with an underscore, maybe even "__lxml_ABC". That should be rare enough out there.
+1, helps with future compatibility and makes it clear to users what is currently supported.
That was just an idea that would allow users to make sure a given marker never matches. It's a lot more generic to use a dedicated translator class.
Also, I think that never matching is a sane fallback for many pseudo-classes.
Definitely for those that do not have one obvious meaning.
To be answered by others.
Absolutely.
Makes sense. It seems to be the one part of lxml that attracts serious external interest lately, and I can definitely see it being useful outside of lxml itself. Are you volunteering to take it over? Note that backwards compatibility is quite important, but it seems you are aware of that.
lxml would still use it to provide convenience methods like HTMLElement.cssselect.
Yes, it would just be a conditional import away and otherwise raise an exception at call time when the dependency is missing. Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Le 06/04/2012 16:36, Stefan Behnel a écrit :
Yes. My full time job currently consists mostly in developing WeasyPrint. I made tinycss as a part of *that*, and I can also allocate some time for cssselect. I also think I understand cssselect’s code fairly well. If you think this is the way to go, I can make a cssselect project on Github and PyPI that is (at first) the same as the current lxml.cssselect, only at a different import name. (I’ll try to preserve the commit history.) I can give push access there to whoever is interested. Then I’ll separate the XPath translation into a class as described in the previous message. From there we can discuss more precisely what to do and how. How does this all sounds? As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem. But more generally, we should decide what is: * Part of the public API, has backward-compatibility promises * or an implementation detail that can change. For example, tinycss relies on the undocumented parsed selector objects. Is it okay to change/break these? -- Simon Sapin
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 08.04.2012 20:58:
Please do. "cssselect" sounds like a good enough name to me. There's also the experimental fork by Laurence Rowe: http://pypi.python.org/pypi/experimental.cssselect You can find my comments on his changes in lxml's github issues list.
Hg, at least, has the convert extension, which allows selectively cloning repositories and moving files around while doing so. I'd expect git to have something similar. Ask back if you need help with this.
Sounds good to me.
As to backward-compatibility: I think that adding support for selectors that were previously not supported is not a problem.
Sure.
In what way does it rely on them? I'm more concerned with the "obviously" public API and the conversion semantics. It would be bad to break a currently working CSS selector expression without a good reason, for example. Personally, I think the implementation of the parser is subject to change at any time, e.g. adding a new argument to the methods to pass down a context object is fine. The new way to extend the generation will be the right way to do it in the future, so any hacks that currently try to hook into it in undocumented ways are IMHO not worth keeping alive if that causes hassle. It will be easier to do the same things with the new architecture (and if it's not, it's the new architecture that should be extended, instead of keeping quirks from the old one). That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership. :-) Stefan
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
Le 10/04/2012 11:46, Stefan Behnel a écrit :
I’ll look at it once I have something equivalent to lxml’s current master with passing tests.
It manipulates the parsed objects (Pseudo, CombinedSelector, etc.) to: * Calculate the specificity * Split out pseudo-elements But I could move both of these into cssselect and avoid the question.
Agreed.
That being said, I will happily tell everyone who complains to point their gun at you instead of me once you've taken over maintainership.:-)
Fair enough :) -- Simon
data:image/s3,"s3://crabby-images/14aaf/14aafd8c8002c91a2a2893ff2082fd8be305b3ef" alt=""
I started something here but it is very much in progress: https://github.com/SimonSapin/cssselect Quick question: how important is Python 2.4 support for cssselect? 2.6 makes many things easier, especially for supporting 3.x with the same code base. Regards, -- Simon
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Simon Sapin, 11.04.2012 13:55:
It worked before - why break it now?
2.6 makes many things easier, especially for supporting 3.x with the same code base.
That argument seems to stick in people's heads way too easily. Sure, it's /easier/, but it's not /difficult/ to stay compatible with older Python versions. There is usually a bit of boilerplate code involved, but once that's there, most things come pretty much for free. For example, you won't run into byte/unicode issues because everything is always going to be Unicode or plain ASCII characters. And since you won't use non-ASCII characters in the source code, you won't have a syntax issue with the 'u' prefix. Getting at exception objects is a bit more involved, and you can't use the "with" statement. Well. Both do not really make up the bulk of the code. In any case, going straight for 2.6 is way overshooting it. Even the latest Django release still supports Py2.5. Stefan
participants (3)
-
Laurence Rowe
-
Simon Sapin
-
Stefan Behnel