[lxml-dev] Tag name validation and HTML
The development branch of lxml 2 appears to restrict the characters that may appear in a tag name. Whilst this may be appropriate for XML, it does not match the behavior of all common HTML UAs and, as such, does not match the current draft of the HTML 5 spec [1]. This is an issue for html5lib [2] as we are keen to keep support for building lxml trees from HTML input, something which is currently possible with lxml 1.3. In an only tangentially related question, is there a recommended way of creating a custom tag type, preferably using the same code for ElementTree and lxml.etree? In particular html5lib needs to create a notional document root element whilst parsing. So far, we have been using an ordinary Element with a .tag that cannot be produced by parsing any input e.g. root.tag="<DOCUMENT_ROOT>" but this doesn't feel very elegant. [1] http://www.whatwg.org/specs/web-apps/current-work/#tag-name0 [2] http://code.google.com/p/html5lib/ -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead
James Graham wrote:
Is there a recommended way of creating a custom tag type, preferably using the same code for ElementTree and lxml.etree?
Both lxml.etree and ElementTree have support for (something like) this, but not in the same way. In ET, you can pass an "element_factory" argument to the TreeBuilder. http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.Ele... In lxml.etree, you can define an Element-Lookup for a parser. http://codespeak.net/lxml/element_classes.html As both approaches work at the parser level, it should be possible (though not too easy) to write some glue code that sets up a parser for either library, and then use the parser in the rest of the code without modification. Note that in lxml.etree, the decision about which element class to use for a given node is not taken inside the parser, but at element access time. Hence the different approaches (and the extensive support in lxml).
In particular html5lib needs to create a notional document root element whilst parsing.
This is a pretty specific problem. You can solve it in lxml.etree in two ways. If the root node has a specific name, you can use the CustomElementClassLookup scheme (so this won't work if you can't control the name of the root node). http://codespeak.net/lxml/element_classes.html#custom-element-class-lookup If the only way to decide about the class is to check for a parent, you can use the tree based lookup and check "getparent()" for None. http://codespeak.net/lxml/element_classes.html#tree-based-element-class-look... I don't think ET can take this decision at all from the element_factory above, but then, you can always replace the root Element /after/ parsing, so I don't think you would even need that machinery here.
So far, we have been using an ordinary Element with a .tag that cannot be produced by parsing any input e.g. root.tag="<DOCUMENT_ROOT>" but this doesn't feel very elegant.
Hmmm, but this changes the document, right? Could you explain a little what that node is supposed to do different than normal nodes? In particular, why can't a tree wrapper do what you want? Stefan
James Graham wrote:
The development branch of lxml 2 appears to restrict the characters that may appear in a tag name. Whilst this may be appropriate for XML, it does not match the behavior of all common HTML UAs and, as such, does not match the current draft of the HTML 5 spec [1].
This is actually not as simple as it might seem. The Element factory cannot distinguish between XML and HTML tags, so it cannot switch off validation for a particular tag. So the conservative solution would be to actually follow the HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. But then there's not much left that you could honestly call validation. Also, I would still want to restrict ":" in tag names, as this has been a source of problems way too often. So that would just leave spaces and any of ":/>" as invalid characters in tag names. BTW, the spec you reference is actually a parser spec. Obviously, allowing "<" or "&" at the API level isn't a good idea either, so we end up defining our own way of validating tag names that would be somewhere between the XML spec and the HTML spec. And it would still allow you to write broken XML without noticing...
This is an issue for html5lib [2] as we are keen to keep support for building lxml trees from HTML input, something which is currently possible with lxml 1.3.
Extensive support for HTML is definitely a goal of lxml, so if the current behaviour breaks the HTML spec, it must change. But I'll have to see how. Any comments appreciated. Stefan
Stefan Behnel wrote:
James Graham wrote:
The development branch of lxml 2 appears to restrict the characters that may appear in a tag name. Whilst this may be appropriate for XML, it does not match the behavior of all common HTML UAs and, as such, does not match the current draft of the HTML 5 spec [1].
This is actually not as simple as it might seem. The Element factory cannot distinguish between XML and HTML tags, so it cannot switch off validation for a particular tag. So the conservative solution would be to actually follow the HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. But then there's not much left that you could honestly call validation. Also, I would still want to restrict ":" in tag names, as this has been a source of problems way too often. So that would just leave spaces and any of ":/>" as invalid characters in tag names.
BTW, the spec you reference is actually a parser spec. Obviously, allowing "<" or "&" at the API level isn't a good idea either, so we end up defining our own way of validating tag names that would be somewhere between the XML spec and the HTML spec. And it would still allow you to write broken XML without noticing...
This patch might make for a good starter. Comments appreciated. Stefan Index: src/lxml/apihelpers.pxi =================================================================== --- src/lxml/apihelpers.pxi (Revision 46892) +++ src/lxml/apihelpers.pxi (Arbeitskopie) @@ -791,7 +791,23 @@ return _xmlNameIsValid(_cstr(name_utf8)) cdef int _xmlNameIsValid(char* c_name): - return tree.xmlValidateNCName(c_name, 0) == 0 + #return tree.xmlValidateNCName(c_name, 0) == 0 + if c_name is NULL or c_name[0] == c'\0': + return 0 + while c_name[0] != c'\0': + if c_name[0] == c':' or \ + c_name[0] == c'&' or \ + c_name[0] == c'<' or \ + c_name[0] == c'>' or \ + c_name[0] == c'/' or \ + c_name[0] == c'\x09' or \ + c_name[0] == c'\x0A' or \ + c_name[0] == c'\x0B' or \ + c_name[0] == c'\x0C' or \ + c_name[0] == c'\x20': + return 0 + c_name = c_name + 1 + return 1 cdef int _tagValidOrRaise(tag_utf) except -1: if not _pyXmlNameIsValid(tag_utf):
Stefan Behnel wrote:
James Graham wrote:
The development branch of lxml 2 appears to restrict the characters that may appear in a tag name. Whilst this may be appropriate for XML, it does not match the behavior of all common HTML UAs and, as such, does not match the current draft of the HTML 5 spec [1].
This is actually not as simple as it might seem. The Element factory cannot distinguish between XML and HTML tags, so it cannot switch off validation for a particular tag. So the conservative solution would be to actually follow the HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. But then there's not much left that you could honestly call validation. Also, I would still want to restrict ":" in tag names, as this has been a source of problems way too often. So that would just leave spaces and any of ":/>" as invalid characters in tag names.
The : thing is difficult because HTML UAs are expected to deal with : in the tag name and there is content in the wild that depends on this being accepted; MS Office produces "HTML" containing tags like <o:p>, for example. Since I, and I guess others too, want to use lxml to process random content that may have colons in the tag names, hard failure for this case is a problem. To make matters worse it is possible that the HTML spec will change in the future to introduce some sort of namespacing feature which may or may not use colons. Given all of this I would prefer it if it were possible to have an HTML-specific mode with much more liberal rules than the XML mode. This could then be adapted to support any namespacing features HTML grows in the future. For example, if one could do something like import lxml.html lxml.html.Element("o:p") where lxml.html.Element would be just like lxml.etree.Element but without XML-specific validity checks. I guess there might be serious practical difficulties with that exact solution, but I think the general idea of being able to flag an element as following HTML rules or XML rules would be more user-friendly than having a set of rules that neither matches the XML nor the HTML model correctly. -- "Mixed up signals Bullet train People snuffed out in the brutal rain" --Conner Oberst
Hi, James Graham wrote:
The : thing is difficult because HTML UAs are expected to deal with : in the tag name and there is content in the wild that depends on this being accepted; MS Office produces "HTML" containing tags like <o:p>, for example. Since I, and I guess others too, want to use lxml to process random content that may have colons in the tag names, hard failure for this case is a problem. To make matters worse it is possible that the HTML spec will change in the future to introduce some sort of namespacing feature which may or may not use colons.
Ok, so I understand that HTML tags must be treated different from XML tags.
Given all of this I would prefer it if it were possible to have an HTML-specific mode with much more liberal rules than the XML mode. This could then be adapted to support any namespacing features HTML grows in the future. For example, if one could do something like
import lxml.html lxml.html.Element("o:p")
where lxml.html.Element would be just like lxml.etree.Element but without XML-specific validity checks.
This absolutely makes sense to me. I'll have to look into the details of an implementation though, since tag name validation is currently done in lxml.etree.Element, which is simply reused by the Python-implemented lxml.html. So we'd have to provide some kind of Python-level API for this. Stefan
participants (2)
-
James Graham
-
Stefan Behnel