BeautifulSoup vs. real-world HTML comments
pavlovevidence at gmail.com
Wed Apr 4 22:17:13 CEST 2007
On Apr 4, 2:43 pm, Robert Kern <robert.k... at gmail.com> wrote:
> Carl Banks wrote:
> > On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:
> >> BeautifulSoup can't parse this page usefully at all.
> >> It treats the entire page as a text chunk. It's actually
> >> HTMLParser that parses comments, so this is really an HTMLParser
> >> level problem.
> > Google for a program called "tidy". Install it, and run it as a
> > filter on any HTML you download. "tidy" has invested in it quite a
> > bit of work understanding common bad HTML and how browsers deal with
> > it. It would be pointless to duplicate that work in the Python
> > standard library; let HTMLParser be small and tight, and outsource the
> > handling of floozy input to a dedicated program.
> Well, BeautifulSoup is just such a dedicated library.
No, not really.
> However, it defers its
> handling of comments to HTMLParser. That's the problem.
Well, it's up to the writers of Beautiful Soup to decide how much bad
HTML they want to accept. ISTM they're happy to live with the
limitations of HTMLParser, meaning that they do not consider Beautiful
Soup to be a library dedicated to reading every piece of bad HTML out
(Though it's not like I read their mailing list. Maybe they aren't
happy with HTMLParser.)
More information about the Python-list