Hi,
> The : thing is difficult because HTML UAs are expected to deal with : in
> the tag name and there is content in the wild that depends on this being
> accepted; MS Office produces "HTML" containing tags like <o:p>, for
> example. Since I, and I guess others too, want to use lxml to process
> random content that may have colons in the tag names, hard failure for
> this case is a problem. To make matters worse it is possible that the
> HTML spec will change …
[View More]in the future to introduce some sort of
> namespacing feature which may or may not use colons.
You'd get errors when parsing such stuff with the XML parser:
>>> etree.fromstring("""<o:p>foo</o:p>""")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "etree.pyx", line 2137, in etree.fromstring
File "parser.pxi", line 1301, in etree._parseMemoryDocument
File "parser.pxi", line 1207, in etree._parseDoc
File "parser.pxi", line 782, in etree._BaseParser._parseDoc
File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc
File "parser.pxi", line 523, in etree._handleParseResult
File "parser.pxi", line 471, in etree._raiseParseError
etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5
but not with the HTML parser:
>>> etree.HTML
<built-in function HTML>
>>> etree.HTML("""<o:p>foo</o:p>""")
<Element html at 2c8030>
>>>
So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements.
For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data.
> Given all of this I would prefer it if it were possible to have an
> HTML-specific mode with much more liberal rules than the XML mode. This
> could then be adapted to support any namespacing features HTML grows in
> the future. For example, if one could do something like
>
> import lxml.html
> lxml.html.Element("o:p")
>
> where lxml.html.Element would be just like lxml.etree.Element but
> without XML-specific validity checks. I guess there might be serious
> practical difficulties with that exact solution, but I think the general
> idea of being able to flag an element as following HTML rules or XML
> rules would be more user-friendly than having a set of rules that
> neither matches the XML nor the HTML model correctly.
Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element().
Holger
--
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
[View Less]