data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Stefan Behnel wrote:
James Graham wrote:
The development branch of lxml 2 appears to restrict the characters that may appear in a tag name. Whilst this may be appropriate for XML, it does not match the behavior of all common HTML UAs and, as such, does not match the current draft of the HTML 5 spec [1].
This is actually not as simple as it might seem. The Element factory cannot distinguish between XML and HTML tags, so it cannot switch off validation for a particular tag. So the conservative solution would be to actually follow the HTML5 spec, as it is a superset of the XML spec, an extremely broad one even. But then there's not much left that you could honestly call validation. Also, I would still want to restrict ":" in tag names, as this has been a source of problems way too often. So that would just leave spaces and any of ":/>" as invalid characters in tag names.
BTW, the spec you reference is actually a parser spec. Obviously, allowing "<" or "&" at the API level isn't a good idea either, so we end up defining our own way of validating tag names that would be somewhere between the XML spec and the HTML spec. And it would still allow you to write broken XML without noticing...
This patch might make for a good starter. Comments appreciated. Stefan Index: src/lxml/apihelpers.pxi =================================================================== --- src/lxml/apihelpers.pxi (Revision 46892) +++ src/lxml/apihelpers.pxi (Arbeitskopie) @@ -791,7 +791,23 @@ return _xmlNameIsValid(_cstr(name_utf8)) cdef int _xmlNameIsValid(char* c_name): - return tree.xmlValidateNCName(c_name, 0) == 0 + #return tree.xmlValidateNCName(c_name, 0) == 0 + if c_name is NULL or c_name[0] == c'\0': + return 0 + while c_name[0] != c'\0': + if c_name[0] == c':' or \ + c_name[0] == c'&' or \ + c_name[0] == c'<' or \ + c_name[0] == c'>' or \ + c_name[0] == c'/' or \ + c_name[0] == c'\x09' or \ + c_name[0] == c'\x0A' or \ + c_name[0] == c'\x0B' or \ + c_name[0] == c'\x0C' or \ + c_name[0] == c'\x20': + return 0 + c_name = c_name + 1 + return 1 cdef int _tagValidOrRaise(tag_utf) except -1: if not _pyXmlNameIsValid(tag_utf):