SelectorSyntaxError elements with name E in any namespace *|E
Hi, I love lxml, thanks for the hard work I just tried "sel = CSSSelector('*|Content')" with the latest lxml release (2.3.5) and got the exception at the bottom of this message I checked the CSS specification and it discusses selectors of this format [1][2]
ns|E elements with name E in namespace ns *|E elements with name E in any namespace, including those without a namespace |E elements with name E without a namespace E if no default namespace has been declared for selectors, this is equivalent to *|E. Otherwise it is equivalent to ns|E where ns is the default namespace.
I also checked the lxml bug tracker but didn't find this issue of elements in any namespace mentioned Is this something that could be supported by lxml? Should I open an issue in the bug tracker for it? Also, I found this thread on the mailing list which discusses support for elements in any namespace [3] and proposes a patch [1] http://www.w3.org/TR/css3-selectors/#typenmsp [2] http://www.w3.org/TR/css3-namespace/#css-qnames [3] http://thread.gmane.org/gmane.comp.python.lxml.devel/4937
Traceback (most recent call last): File "example.py", line 35, in <module> sel = CSSSelector('*|Content') File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 51, in __init__ path = css_to_xpath(css) File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 537, in css_to_xpath css_expr = parse(css_expr) File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 662, in parse return parse_selector_group(stream) File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 677, in parse_selector_group result.append(parse_selector(stream)) File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 688, in parse_selector result = parse_simple_selector(stream) File "/home/nottheoilrig/lxml-2.3.5/src/lxml/cssselect.py", line 724, in parse_simple_selector "Expected symbol, got '%s'" % next) lxml.cssselect.SelectorSyntaxError: Expected symbol, got '*' at [Token(u'*', 0), Token(u'|', 1), Symbol(u'Content', 2)] -> None
Le 06/08/2012 08:19, Jack Bates a écrit :
I just tried "sel = CSSSelector('*|Content')" with the latest lxml release (2.3.5) and got the exception at the bottom of this message
Hi Jack, The parser in lxml.cssselect 2.3 and earlier is broken is many interesting ways. I recently took over maintenance of the project, made it a separate PyPI package and fixed a lot of stuff: http://packages.python.org/cssselect/ In lxml 3.0, lxml.cssselect becomes a thin wrapper around the new cssselect. You can either use the alpha from last week, or use lxml 2.3 with a more verbose syntax: import cssselect import lxml.etree translator = cssselect.GenericTranslator() # or HTMLTranslator() sel = lxml.etree.XPath(translator.css_to_xpath('*|Content')) Now, in the new cssselect the *syntax* for your selector is supported, but I’m not sure that the XPath translation is correct. Any suggestion in this area is welcome. Related issue: https://github.com/SimonSapin/cssselect/issues/9 As a side note, the "translating to XPath" approach to CSS selectors seems easy at first but when trying to be spec-compliant you quickly find corner cases that are not obvious at all. In fact there are valid selectors for which I’m not sure that a correct XPath translation even exists: https://github.com/SimonSapin/cssselect/issues/12 I started playing with a different approach: straightforward Python code to compile selectors into (element -> bool) callables that use the lxml API. It’s currently slower than cssselect (which profits from libxml2’s optimized XPath engine) but much simpler and probably more correct. Also very much work-in-progress: I literally typed 'git init' yesterday. (Suggestions welcome on this one too!) https://github.com/SimonSapin/lselect Have fun, -- Simon Sapine
Simon Sapin, 06.08.2012 09:15:
I started playing with a different approach: straightforward Python code to compile selectors into (element -> bool) callables that use the lxml API. It’s currently slower than cssselect (which profits from libxml2’s optimized XPath engine) but much simpler and probably more correct. Also very much work-in-progress: I literally typed 'git init' yesterday. (Suggestions welcome on this one too!)
Interesting. Are you doing bottom-up evaluation here? It looks like it's designed for testing a given element, i.e. each element separately if you are looking for matches in a subtree. Is that really an important use case? If you can manage to reverse the selection, you should get a huge speedup in tree searches by translating descendent selectors into "parent.iter(tag)" (which is faster than the equivalent XPath expression in lxml), instead of instantiating each element and testing its tag, which is horribly slow. Similar for iterchildren(), iterancestors() etc. That's a breach of purity, sure, but an important optimisation. Take a look at the _elementpath.py module in lxml, which has a partial 'XPath' implementation based on generators. Having the same thing for CSS selectors would be great. Stefan
Le 06/08/2012 09:41, Stefan Behnel a écrit :
Interesting. Are you doing bottom-up evaluation here?
Yes. I read a few times that web browsers match selectors right to left: http://stackoverflow.com/questions/5797014/css-selectors-parsed-right-to-lef... https://developer.mozilla.org/en/Writing_Efficient_CSS I think they also loop on elements, and then find selectors that match, ie. the reversed nesting of a typical cssselect usage: for each selector find elements that match. This is just an experiment is this direction. It also seems to make sense for the descendant combinator since an element often has many more descendants than ancestors. (Although this claim is based on nothing scientific.)
It looks like it's designed for testing a given element, i.e. each element separately if you are looking for matches in a subtree. Is that really an important use case?
Not really. My main use case in WeasyPrint is to apply stylesheets to a whole document. I want (style rule, element) pairs but don’t care in which order they come.
If you can manage to reverse the selection, you should get a huge speedup in tree searches by translating descendent selectors into "parent.iter(tag)" (which is faster than the equivalent XPath expression in lxml), instead of instantiating each element and testing its tag, which is horribly slow. Similar for iterchildren(), iterancestors() etc. That's a breach of purity, sure, but an important optimisation. Take a look at the _elementpath.py module in lxml, which has a partial 'XPath' implementation based on generators. Having the same thing for CSS selectors would be great.
Interesting. Of course this only work with element type selectors after a combinator but this is a common case. The problem is that it only works with "fully qualified" element types such as |E or NS|E. When there is no default namespace declared in the stylesheet (that is most of the time) the E selector actually means *|E Can I use ancestor.iter(tag) with some tag object that means "this local name in any namespace"? Currently I use this bool expression. It’s probably not fast at all: return 'el.tag.rsplit("}", 1)[-1] == %r' % tag -- Simon Sapin
Simon Sapin, 06.08.2012 10:19:
Le 06/08/2012 09:41, Stefan Behnel a écrit :
Interesting. Are you doing bottom-up evaluation here?
Yes. I read a few times that web browsers match selectors right to left:
http://stackoverflow.com/questions/5797014/css-selectors-parsed-right-to-lef... https://developer.mozilla.org/en/Writing_Efficient_CSS
I think they also loop on elements, and then find selectors that match, ie. the reversed nesting of a typical cssselect usage: for each selector find elements that match.
Right. Totally different use case.
This is just an experiment is this direction. It also seems to make sense for the descendant combinator since an element often has many more descendants than ancestors. (Although this claim is based on nothing scientific.)
Still, it's faster to search for matching descendants in O(n) than to walk over all elements in O(n) and test all (or many) of their ancestors, which is closer to something like O(n*log(n)).
It looks like it's designed for testing a given element, i.e. each element separately if you are looking for matches in a subtree. Is that really an important use case?
Not really. My main use case in WeasyPrint is to apply stylesheets to a whole document. I want (style rule, element) pairs but don’t care in which order they come.
Hmm, in that case, you also have to take precedence rules between the separate style sections into account - those aren't trivial. Anyway, since styles are inherited by entire subtrees, it might really be better to walk the tree and to collect styles along the path, than to apply each style section separately to the entire tree. You might also want to do a bit of hashing, at least for the easy (and common) cases where selectors end with a tag name.
If you can manage to reverse the selection, you should get a huge speedup in tree searches by translating descendent selectors into "parent.iter(tag)" (which is faster than the equivalent XPath expression in lxml), instead of instantiating each element and testing its tag, which is horribly slow. Similar for iterchildren(), iterancestors() etc. That's a breach of purity, sure, but an important optimisation. Take a look at the _elementpath.py module in lxml, which has a partial 'XPath' implementation based on generators. Having the same thing for CSS selectors would be great.
Interesting. Of course this only work with element type selectors after a combinator but this is a common case.
The problem is that it only works with "fully qualified" element types such as |E or NS|E. When there is no default namespace declared in the stylesheet (that is most of the time) the E selector actually means *|E
Can I use ancestor.iter(tag) with some tag object that means "this local name in any namespace"? Currently I use this bool expression. It’s probably not fast at all:
return 'el.tag.rsplit("}", 1)[-1] == %r' % tag
You can search for "{*}localname". Stefan
Le 06/08/2012 10:43, Stefan Behnel a écrit :
Simon Sapin, 06.08.2012 10:19:
Le 06/08/2012 09:41, Stefan Behnel a écrit :
Interesting. Are you doing bottom-up evaluation here?
Yes. I read a few times that web browsers match selectors right to left:
http://stackoverflow.com/questions/5797014/css-selectors-parsed-right-to-lef... https://developer.mozilla.org/en/Writing_Efficient_CSS
I think they also loop on elements, and then find selectors that match, ie. the reversed nesting of a typical cssselect usage: for each selector find elements that match.
Right. Totally different use case.
The use case differs in that browsers want to do progressing rendering or minimal updates when the DOM changes dynamically, while PDF conversion is static. Other than that this is mostly implementation details.
This is just an experiment is this direction. It also seems to make sense for the descendant combinator since an element often has many more descendants than ancestors. (Although this claim is based on nothing scientific.)
Still, it's faster to search for matching descendants in O(n) than to walk over all elements in O(n) and test all (or many) of their ancestors, which is closer to something like O(n*log(n)).
There are also m selectors, but I guess this does not change the big-Os
My main use case in WeasyPrint is to apply stylesheets to a whole document. I want (style rule, element) pairs but don’t care in which order they come.
Hmm, in that case, you also have to take precedence rules between the separate style sections into account - those aren't trivial.
The cascade is not trivial but not very hard either. I already spent time on this and figured it out. The rule precedence is by origin/priority, selector specificity, then source order. The latter can be encoded as an integer if needed, that’s not a problem.
Anyway, since styles are inherited by entire subtrees, it might really be better to walk the tree and to collect styles along the path, than to apply each style section separately to the entire tree.
Yes, by looping on elements first and then selectors I could do the cascade and inheritance in one pass (against two currently.) That would help with memory usage but I’m not sure about speed.
You might also want to do a bit of hashing, at least for the easy (and common) cases where selectors end with a tag name.
I don’t understand, hashing what?
return 'el.tag.rsplit("}", 1)[-1] == %r' % tag
You can search for "{*}localname".
Great! That’s what I was missing. That would be for the .iter*() methods. Can I do a similar test on a single element? -- Simon Sapin
Simon Sapin, 06.08.2012 11:25:
Le 06/08/2012 10:43, Stefan Behnel a écrit :
You might also want to do a bit of hashing, at least for the easy (and common) cases where selectors end with a tag name.
I don’t understand, hashing what?
I was suggesting to build up a dict of tag->selectors mappings that directly list the potentially matching selectors for a given tag name. Plus a shorter list of selectors that do not end in a simple tag name match but e.g. a plain class attribute or whatever. That would reduce the number of selector tests that you need to run on each element while walking the tree. Stefan
Le 06/08/2012 10:43, Stefan Behnel a écrit :
return 'el.tag.rsplit("}", 1)[-1] == %r' % tag
You can search for "{*}localname".
Unfortunately that doesn’t seem to work: >>> t = lxml.etree.fromstring( ... '<r xmlns:a="aa" xmlns:b="bb"><a:e/><b:e/></r>') >>> list(t.iter('{aa}*')) [<Element {aa}e at 0x273d500>] >>> list(t.iter('{*}e')) [] # I expected [{aa}e, {bb}e] Same on lxml 2.3.5 and 3.0.alpha1 -- Simon Sapin
Simon Sapin, 06.08.2012 13:51:
Le 06/08/2012 10:43, Stefan Behnel a écrit :
return 'el.tag.rsplit("}", 1)[-1] == %r' % tag
You can search for "{*}localname".
Unfortunately that doesn’t seem to work:
>>> t = lxml.etree.fromstring( ... '<r xmlns:a="aa" xmlns:b="bb"><a:e/><b:e/></r>') >>> list(t.iter('{aa}*')) [<Element {aa}e at 0x273d500>] >>> list(t.iter('{*}e')) [] # I expected [{aa}e, {bb}e]
Same on lxml 2.3.5 and 3.0.alpha1
Hmm, right. Sorry, I thought that worked. Looking at the source, it's not a quick fix, though (which is most likely the reason it's not implemented ...). The code is in lxml.etree.pyx class _MultiTagMatcher, and apihelpers.pxi function _mapTagsToQnameMatchArray(), in case someone wants to give it a try. Stefan
Le 06/08/2012 18:54, Stefan Behnel a écrit :
The code is in lxml.etree.pyx class _MultiTagMatcher, and apihelpers.pxi function _mapTagsToQnameMatchArray(), in case someone wants to give it a try.
If the patch is done soonish, can I expect to have it in lxml? -- Simon Sapin
Simon Sapin, 06.08.2012 19:02:
Le 06/08/2012 18:54, Stefan Behnel a écrit :
The code is in lxml.etree.pyx class _MultiTagMatcher, and apihelpers.pxi function _mapTagsToQnameMatchArray(), in case someone wants to give it a try.
If the patch is done soonish, can I expect to have it in lxml?
Sure - depending on the patch. ;) Stefan
Le 06/08/2012 18:54, Stefan Behnel a écrit :
Hmm, right. Sorry, I thought that worked.
Looking at the source, it's not a quick fix, though (which is most likely the reason it's not implemented ...). The code is in lxml.etree.pyx class _MultiTagMatcher, and apihelpers.pxi function _mapTagsToQnameMatchArray(), in case someone wants to give it a try.
Let’s talk data structure before attacking the code. From my reading of the code in 3.0.alpha1: Strings for the 'tag' argument to iter*() methods are converted in _MultiTagMatcher first to a list of (href, name) pairs of Python strings and then to an array of qname: cdef struct qname: const_xmlChar* c_name python.PyObject* href Later, _tagMatchesExactly takes a node and a qname, and returns a boolean int. A NULL pointer in qname.c_name is a wildcard that matches any local-name. A NULL for qname.href however only matches elements that do not have a namespace. What I would (ideally) need for lselect is to match the namespace and local name independently: * No namespace * A given namespace URL * Any namespace, including no namespace * A given local name * Any local name For a total of 6 combinations. qname.c_name already has two possible states, but we need an additional state for qname.href. We could either add a flag to the struct, or use a special non-string value as a marker. In lselect I use the "any" built-in function as a marker, but doing a global lookup in _tagMatchesExactly is probably bad for speed. None could be the marker, but I find confusing to have a NULL pointer and a pointer to None mean completely different things. Back to the higher level: strings in the 'tag' argument. '{}*', '{ns}*', '{*}*', '{}e' and '{ns}e' all work nicely in alpha1, only '{*}e' needs to be fixed. This is only a guess, but I’d say that no existing code relies on the current (wrong) behavior for '{*}e'. (Side note: 'e' currently means '{}e' while '*' means '{*}*'. The inconsistency is unfortunate but fixing it would probably break existing code.) Thoughts? -- Simon Sapin
Simon Sapin, 10.08.2012 12:23:
Le 06/08/2012 18:54, Stefan Behnel a écrit :
Looking at the source, it's not a quick fix, though (which is most likely the reason it's not implemented ...). The code is in lxml.etree.pyx class _MultiTagMatcher, and apihelpers.pxi function _mapTagsToQnameMatchArray(), in case someone wants to give it a try.
Let’s talk data structure before attacking the code. From my reading of the code in 3.0.alpha1:
Strings for the 'tag' argument to iter*() methods are converted in _MultiTagMatcher first to a list of (href, name) pairs of Python strings and then to an array of qname:
cdef struct qname: const_xmlChar* c_name python.PyObject* href
Later, _tagMatchesExactly takes a node and a qname, and returns a boolean int. A NULL pointer in qname.c_name is a wildcard that matches any local-name. A NULL for qname.href however only matches elements that do not have a namespace.
What I would (ideally) need for lselect is to match the namespace and local name independently:
* No namespace * A given namespace URL * Any namespace, including no namespace
* A given local name * Any local name
For a total of 6 combinations.
qname.c_name already has two possible states, but we need an additional state for qname.href. We could either add a flag to the struct, or use a special non-string value as a marker.
In lselect I use the "any" built-in function as a marker, but doing a global lookup in _tagMatchesExactly is probably bad for speed. None could be the marker, but I find confusing to have a NULL pointer and a pointer to None mean completely different things.
You can express three states with the current data structures by using NULL, b'' (empty bytes string) and b'non-emtpy-utf8-string' for wildcard, empty and fixed value respectively. The empty C string is easy and fast to test with cstring[0] == '\0' (after testing cstring for NULL). The _MultiTagMatcher is private and new in 3.0, so feel free to fix it up as you see fit.
Back to the higher level: strings in the 'tag' argument. '{}*', '{ns}*', '{*}*', '{}e' and '{ns}e' all work nicely in alpha1, only '{*}e' needs to be fixed.
This is only a guess, but I’d say that no existing code relies on the current (wrong) behavior for '{*}e'.
Agreed. I'm not even sure if these wildcards are properly documented anywhere.
(Side note: 'e' currently means '{}e' while '*' means '{*}*'. The inconsistency is unfortunate but fixing it would probably break existing code.)
Right. Stefan
Le 10/08/2012 20:29, Stefan Behnel a écrit :
You can express three states with the current data structures by using NULL, b'' (empty bytes string) and b'non-emtpy-utf8-string' for wildcard, empty and fixed value respectively. The empty C string is easy and fast to test with cstring[0] == '\0' (after testing cstring for NULL).
Sounds good. Are the empty string and NULL always equivalent for the namespace of a node in libxml2/lxml? Currently _tagMatchesExactly seems to be checking for both. -- Simon Sapin
Simon Sapin, 10.08.2012 20:48:
Le 10/08/2012 20:29, Stefan Behnel a écrit :
You can express three states with the current data structures by using NULL, b'' (empty bytes string) and b'non-emtpy-utf8-string' for wildcard, empty and fixed value respectively. The empty C string is easy and fast to test with cstring[0] == '\0' (after testing cstring for NULL).
Sounds good.
Are the empty string and NULL always equivalent for the namespace of a node in libxml2/lxml? Currently _tagMatchesExactly seems to be checking for both.
Nope. The empty string is the empty string. What you are looking at (I think) is just one form of an equality test. Stefan
Le 10/08/2012 20:54, Stefan Behnel a écrit :
Nope. The empty string is the empty string. What you are looking at (I think) is just one form of an equality test.
I was looking at linke 958 of apihelpers.pxi, in _tagMatchesExactly: return c_node_href is NULL or c_node_href[0] == '\0' Both are treated the same. I tried parsing a few variations of <a xmlns="">, all raised XMLSyntaxError. Also, Element('{}a').tag gives just 'a', QName(Element('{}a')).namespace is None. Apparently it not possible to build a lxml element with an empty namespace that is different from no namespace. -- Simon Sapin
Le 10/08/2012 20:29, Stefan Behnel a écrit :
cdef struct qname: const_xmlChar* c_name python.PyObject* href
You can express three states with the current data structures by using NULL, b'' (empty bytes string) and b'non-emtpy-utf8-string' for wildcard, empty and fixed value respectively. The empty C string is easy and fast to test with cstring[0] == '\0' (after testing cstring for NULL).
href is a Python string. Can I just use c_qname.href == '' in Cython?
Simon Sapin, 11.08.2012 15:01:
Le 10/08/2012 20:29, Stefan Behnel a écrit :
cdef struct qname: const_xmlChar* c_name python.PyObject* href
You can express three states with the current data structures by using NULL, b'' (empty bytes string) and b'non-emtpy-utf8-string' for wildcard, empty and fixed value respectively. The empty C string is easy and fast to test with cstring[0] == '\0' (after testing cstring for NULL).
href is a Python string. Can I just use c_qname.href == '' in Cython?
It's only a Python (bytes) string to you. For Cython, it's really declared as a pointer, i.e. it won't do refcounting on it and no object operations either. You can cast it back into an object like this: <bytes>c_qname.href and the do with it whatever you like. However, the fastest way to test it for being the empty string (knowing that it's really a bytes object and nothing else), is as follows: python.__cstr(c_qname.href)[0] == c'\0' "__cstr()" basically unfolds into PyBytes_AS_STRING(), but is declared on a PyObject* instead of an object. So you get the pointer to its bytes buffer back and compare its first char value to the C string end marker. There's this code in the current matcher, which uses libxml2's xmLStrcmp() to compare two C string values: if tree.xmlStrcmp( c_href, <const_xmlChar*>python.__cstr(c_qname.href)) == 0: Stefan
Ok, the patch is in: https://github.com/lxml/lxml/pull/66 I tried to document in the change log all the public methods/functions that are affected. Apparently I some stuff in iterpase.pxi also uses _MultiTagMatcher, but I’m not sure how that translates into the public API. -- Simon Sapin
On 06/08/12 12:15 AM, Simon Sapin wrote:
Le 06/08/2012 08:19, Jack Bates a écrit :
I just tried "sel = CSSSelector('*|Content')" with the latest lxml release (2.3.5) and got the exception at the bottom of this message
Hi Jack,
The parser in lxml.cssselect 2.3 and earlier is broken is many interesting ways. I recently took over maintenance of the project, made it a separate PyPI package and fixed a lot of stuff:
http://packages.python.org/cssselect/
In lxml 3.0, lxml.cssselect becomes a thin wrapper around the new cssselect. You can either use the alpha from last week, or use lxml 2.3 with a more verbose syntax:
import cssselect import lxml.etree translator = cssselect.GenericTranslator() # or HTMLTranslator() sel = lxml.etree.XPath(translator.css_to_xpath('*|Content'))
Now, in the new cssselect the *syntax* for your selector is supported, but I’m not sure that the XPath translation is correct. Any suggestion in this area is welcome.
Related issue: https://github.com/SimonSapin/cssselect/issues/9
As a side note, the "translating to XPath" approach to CSS selectors seems easy at first but when trying to be spec-compliant you quickly find corner cases that are not obvious at all. In fact there are valid selectors for which I’m not sure that a correct XPath translation even exists:
https://github.com/SimonSapin/cssselect/issues/12
I started playing with a different approach: straightforward Python code to compile selectors into (element -> bool) callables that use the lxml API. It’s currently slower than cssselect (which profits from libxml2’s optimized XPath engine) but much simpler and probably more correct. Also very much work-in-progress: I literally typed 'git init' yesterday. (Suggestions welcome on this one too!)
Thanks very much for explaining the state of affairs and for all the work I see you doing on cssselect and lselect. cssselect is very handy, thanks for your improvements. I look forward to trying lselect
Le 10/08/2012 12:34, Jack Bates a écrit :
I look forward to trying lselect
It’s not quite ready yet, but I plan to make a 0.1 release Soon® with an API, tests and documentation. At that point it should be correct but maybe not fast. The next step will be to make it fast :) -- Simon Sapin
participants (3)
-
Jack Bates
-
Simon Sapin
-
Stefan Behnel