Mailman 3 [lxml-dev] Re: Tag name validation and HTML - lxml - The Python XML Toolkit

5 Oct 2007


      Hi,
...
The : thing is difficult because HTML UAs are expected to deal with : in 
the tag name and there is content in the wild that depends on this being 
accepted; MS Office produces "HTML" containing tags like <o:p>, for 
example. Since I, and I guess others too, want to use lxml to process 
random content that may have colons in the tag names, hard failure for 
this case is a problem. To make matters worse it is possible that the 
HTML spec will change in the future to introduce some sort of 
namespacing feature which may or may not use colons.
You'd get errors when parsing such stuff with the XML parser:
...
...
...
etree.fromstring("""<o:p>foo</o:p>""")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "etree.pyx", line 2137, in etree.fromstring
  File "parser.pxi", line 1301, in etree._parseMemoryDocument
  File "parser.pxi", line 1207, in etree._parseDoc
  File "parser.pxi", line 782, in etree._BaseParser._parseDoc
  File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc
  File "parser.pxi", line 523, in etree._handleParseResult
  File "parser.pxi", line 471, in etree._raiseParseError
etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5
but not with the HTML parser:
...
...
...
etree.HTML
<built-in function HTML>
etree.HTML("""<o:p>foo</o:p>""")
<Element html at 2c8030>
So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements.
For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data.
...
Given all of this I would prefer it if it were possible to have an 
HTML-specific mode with much more liberal rules than the XML mode. This 
could then be adapted to support any namespacing features HTML grows in 
the future. For example, if one could do something like
import lxml.html
lxml.html.Element("o:p")
where lxml.html.Element would be just like lxml.etree.Element but 
without XML-specific validity checks. I guess there might be serious 
practical difficulties with that exact solution, but I think the general 
idea of being able to flag an element as following HTML rules or XML 
rules would be more user-friendly than having a set of rules that 
neither matches the XML nor the HTML model correctly.
Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element().

Holger

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

[lxml-dev] Re: Tag name validation and HTML

jholg＠gmx.de

tags

participants (1)