Re: [Twisted-Python] Re: Contributing?

Aug. 26, 2004

      On Aug 26, 2004, at 2:15 PM, Nicola Larosa wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
...
There are a variety of other Python HTML parsers, but from what I can
tell, they're even worse than microdom is. It'd be way cool to have a
python HTML parser that actually works.
People say nice things about Beautiful Soup:
http://www.crummy.com/software/BeautifulSoup/
Unfortunately, it's trying to solve a completely different problem. It 
is not to hoping to make a tree of the entire document, but rather, to 
do something like "give me all the hrefs on the page". As such, it 
doesn't even *try* to parse html properly, it just knows enough to be 
able to ignore the parts of the page you aren't asking for.

Its intro says:
...
A well-formed HTML document will yield a well-formed data
structure. An ill-formed HTML document will yield a correspondingly
ill-formed data structure. If your document is only locally
well-formed, you can use this to process the well-formed part of it.
However, that is not entirely accurate, unless "well formed" doesn't 
mean "follows the HTML4 standard". It doesn't parse 
"<table><tr><td>foo<tr><td>bar</table>" correctly -- a perfectly valid 
bit of HTML4. Microdom's goal is to yield a well-formed data structure 
from a well-formed HTML document, and most ill-formed HTML documents 
too.

James

Re: [Twisted-Python] Re: Contributing?

James Y Knight