[Python-Dev] Integrate BeautifulSoup into stdlib?
tonynelson at georgeanelson.com
Wed Mar 4 18:13:04 CET 2009
At 2:56 PM +0000 3/4/09, Chris Withers wrote:
>Vaibhav Mallya wrote:
>> We do have HTMLParser, but that doesn't handle malformed pages well, and
>> just isn't as nice as BeautifulSoup.
>Interesting, given that BeautifulSoup is built on HTMLParser ;-)
In BeautifulSoup >= 3.1, yes. Before that (<= 3.07a), it was based on the
more robust sgmllib.SGMLParser. The current BeautifulSoup can't handle
'<foo a="bc"b="cd">', while the earlier SGMLParser versions can. I don't
know quite how common that missing space is in the wild, but I've
personally made HTML with that problem. Maybe this is the only problem
with using HTMLParser instead of SGMLParser; I don't know. In the mean
time, if I have a need for BeautifulSoup in Python3.x, I'll port sgmllib
and use the older BeautifulSoup.
TonyN.:' <mailto:tonynelson at georgeanelson.com>
More information about the Python-Dev