
On Aug 26, 2004, at 2:15 PM, Nicola Larosa wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
There are a variety of other Python HTML parsers, but from what I can tell, they're even worse than microdom is. It'd be way cool to have a python HTML parser that actually works.
People say nice things about Beautiful Soup:
Unfortunately, it's trying to solve a completely different problem. It is not to hoping to make a tree of the entire document, but rather, to do something like "give me all the hrefs on the page". As such, it doesn't even *try* to parse html properly, it just knows enough to be able to ignore the parts of the page you aren't asking for. Its intro says:
A well-formed HTML document will yield a well-formed data structure. An ill-formed HTML document will yield a correspondingly ill-formed data structure. If your document is only locally well-formed, you can use this to process the well-formed part of it.
However, that is not entirely accurate, unless "well formed" doesn't mean "follows the HTML4 standard". It doesn't parse "<table><tr><td>foo<tr><td>bar</table>" correctly -- a perfectly valid bit of HTML4. Microdom's goal is to yield a well-formed data structure from a well-formed HTML document, and most ill-formed HTML documents too. James