XPath: Allowed characters in element names
Hi, Is there something in the lxml API that can tell me if a particular Unicode string is a valid element/attribute name or prefix in an XPath expression? Obviously some characters like U+005B '[' have a special meaning in XPath, but others like U+00D7 MULTIPLICATION SIGN are note allowed for not apparent reason. Trying to make an Element object with that name and catching the ValueError is almost what I want, except that Element accepts the {namespace_URI}local_name syntax. Following links from the XPath spec I find that all of these are NCName tokens: http://www.w3.org/TR/REC-xml-names/#NT-NCName So I guess I will encode this grammar as a regexp. It should work but it won’t be pretty, with 17 or so character ranges. Questions: 1. Does the sets of allowed or not allowed names in the implementation match the grammar linked above? (I guess I should check the libxml2 source for this.) 2. Is there a better way than an ugly regexp? 3. When I get an invalid name, I can replace eg. 'foo' with something like '*[name() = "foo"]'. But if elements can never have this name anyway, is '*[0]' equivalent? (An expression that never matches.) Context: I want to make sure that XPath expressions built by cssselect are valid and correct. With the recently-fixed backshlash-escapes it is really easy to find a selector that currently translate to an invalid XPath expression, or even to a valid but incorrect expression. (Code injection in XPath is probably not as big of a security problem is it can with in SQL, but still.) Thanks, -- Simon Sapin
participants (1)
-
Simon Sapin