lxml - The Python XML Toolkit

Download

lxml@python.org

October 2007

21 participants
31 discussions

[lxml-dev] Re: Tag name validation and HTML
by jholg＠gmx.de Oct. 5, 2007

Oct. 5, 2007

Hi, > The : thing is difficult because HTML UAs are expected to deal with : in > the tag name and there is content in the wild that depends on this being > accepted; MS Office produces "HTML" containing tags like <o:p>, for > example. Since I, and I guess others too, want to use lxml to process > random content that may have colons in the tag names, hard failure for > this case is a problem. To make matters worse it is possible that the > HTML spec will change … [View More]in the future to introduce some sort of > namespacing feature which may or may not use colons. You'd get errors when parsing such stuff with the XML parser: >>> etree.fromstring("""<o:p>foo</o:p>""") Traceback (most recent call last): File "<stdin>", line 1, in ? File "etree.pyx", line 2137, in etree.fromstring File "parser.pxi", line 1301, in etree._parseMemoryDocument File "parser.pxi", line 1207, in etree._parseDoc File "parser.pxi", line 782, in etree._BaseParser._parseDoc File "parser.pxi", line 444, in etree._ParserContext._handleParseResultDoc File "parser.pxi", line 523, in etree._handleParseResult File "parser.pxi", line 471, in etree._raiseParseError etree.XMLSyntaxError: Namespace prefix o on p is not defined, line 1, column 5 but not with the HTML parser: >>> etree.HTML <built-in function HTML> >>> etree.HTML("""<o:p>foo</o:p>""") <Element html at 2c8030> >>> So here's a distinction between HTML and XML, but not API-wise, e.g when creating elements. For my usecase, I must *rely* on producing valid XML through the API, so making things more liberal potentially breaks my system. That's because I need to pickle (i.e. serialize) tree content and reparse somewhere else. Now if I allow for producing invalid XML, some data receiver will choke on my data. > Given all of this I would prefer it if it were possible to have an > HTML-specific mode with much more liberal rules than the XML mode. This > could then be adapted to support any namespacing features HTML grows in > the future. For example, if one could do something like > > import lxml.html > lxml.html.Element("o:p") > > where lxml.html.Element would be just like lxml.etree.Element but > without XML-specific validity checks. I guess there might be serious > practical difficulties with that exact solution, but I think the general > idea of being able to flag an element as following HTML rules or XML > rules would be more user-friendly than having a set of rules that > neither matches the XML nor the HTML model correctly. Sounds better to me than introducing some mixed set of rules. And I don't even think that it's difficult to implement, though it might mean introducing another public factory or some sort of switch on Element(). Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer [View Less]

1 0