intolerant HTML parser

Mon Feb 8 12:12:46 EST 2010

Stefan Behnel wrote:

> I don't read it that way. There's a huge difference between
>
> - generating HTML manually and validating (some of) it in a unit test
>
> and
>
> - generating HTML using a tool that guarantees correct HTML output
>
> the advantage of the second approach being that others have already done
> all the debugging for you.

Anyone TDDing around HTML or XML should use or fork my assert_xml()
(from django-test-extensions).

The current version trivially detects a leading <html> tag and uses
etree.HTML(xml); else it goes with the stricter etree.XML(xml). The
former will not complain about the provided sample HTML.

Sadly, the industry has such a legacy of HTML written in Notepad that
well-formed (X)HTML will never be well-formed XML. My own action item
here is to apply Stefan's parser_options suggestion to make the
etree.HTML() stricter.

However, a generator is free to produce arbitrarily restricted XML
that avoids the problems with XHTML. It could, for example, push any
Javascript that even dreams of using & instead of & out into .js
files.

So an assert_xml() hot-wired to process only XML - with the true HTML
doctype - is still useful to TDD generated code, because its XPath
reference will detect that you get the nodes you expect.

--
  Phlip
  http://c2.com/cgi/wiki?ZeekLand