[lxml-dev] Re: Tag name validation and HTML
Hi,
The : thing is difficult because HTML UAs are expected to deal with : in the tag name and there is content in the wild that depends on this being accepted; MS Office produces "HTML" containing tags like <o:p>, for example. Since I, and I guess others too, want to use lxml to process random content that may have colons in the tag names, hard failure for this case is a problem. To make matters worse it is possible that the HTML spec will change in the future to introduce some sort of namespacing feature which may or may not use colons.
You'd get errors when parsing such stuff with the XML parser:
etree.fromstring("""<o:p>foo</o:p>""") Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 2137, in etree.fromstring File "parser.pxi", line 1301, in etree._parseMemoryDocument File "parser.pxi", line 1207, in etree._parseDoc File "parser.pxi", line 782, in etree._BaseParser._parseDoc File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc File "parser.pxi", line 523, in etree._handleParseResult File "parser.pxi", line 471, in etree._raiseParseError etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5
but not with the HTML parser:
etree.HTML <built-in function HTML> etree.HTML("""<o:p>foo</o:p>""") <Element html at 2c8030>
So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements. For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data.
Given all of this I would prefer it if it were possible to have an HTML-specific mode with much more liberal rules than the XML mode. This could then be adapted to support any namespacing features HTML grows in the future. For example, if one could do something like
import lxml.html lxml.html.Element("o:p")
where lxml.html.Element would be just like lxml.etree.Element but without XML-specific validity checks. I guess there might be serious practical difficulties with that exact solution, but I think the general idea of being able to flag an element as following HTML rules or XML rules would be more user-friendly than having a set of rules that neither matches the XML nor the HTML model correctly.
Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element(). Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
participants (1)
-
jholg@gmx.de