BeautifulSoup vs. real-world HTML comments
robert.kern at gmail.com
Thu Apr 5 00:12:49 CEST 2007
Carl Banks wrote:
> On Apr 4, 4:55 pm, Robert Kern <robert.k... at gmail.com> wrote:
>> Carl Banks wrote:
>>> On Apr 4, 2:43 pm, Robert Kern <robert.k... at gmail.com> wrote:
>>>> Carl Banks wrote:
>>>>> On Apr 4, 2:08 pm, John Nagle <n... at animats.com> wrote:
>>>>>> BeautifulSoup can't parse this page usefully at all.
>>>>>> It treats the entire page as a text chunk. It's actually
>>>>>> HTMLParser that parses comments, so this is really an HTMLParser
>>>>>> level problem.
>>>>> Google for a program called "tidy". Install it, and run it as a
>>>>> filter on any HTML you download. "tidy" has invested in it quite a
>>>>> bit of work understanding common bad HTML and how browsers deal with
>>>>> it. It would be pointless to duplicate that work in the Python
>>>>> standard library; let HTMLParser be small and tight, and outsource the
>>>>> handling of floozy input to a dedicated program.
>>>> Well, BeautifulSoup is just such a dedicated library.
>>> No, not really.
>> Yes, it is. Whether it succeeds in all particulars is besides the point. The
>> only mission of BeautifulSoup is to handle bad HTML.
> I think the authors of BeautifulSoup have the right to decide what
> their own mission is.
Yes, and he's stated it pretty clearly:
"""You didn't write that awful page. You're just trying to get some data out of
it. Right now, you don't really care what HTML is supposed to look like.
Neither does this parser."""
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
More information about the Python-list