walter at livinglogic.de
Thu Sep 23 21:27:23 CEST 2004
Richie Hindle wrote:
>>BeautifulSoup is perfect for this job:
> Um, BeautifulSoup may be perfect, but my script isn't. It fails with the
> Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
> And I don't know whether you'd consider it correct to extract only the bold
> text from the entries that have bold text. But it gives you a place to start.
Another option might be the HTML parser from libxml2 (www.xmlsoft.org):
>>> import libxml2
>>> doc = libxml2.htmlParseFile("http://www.python.org", None)
http://www.python.org:3: HTML parser error : htmlParseStartTag: invalid
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...
More information about the Python-list