Parsing HTML

Walter Dörwald walter at
Thu Sep 23 21:27:23 CEST 2004

Richie Hindle wrote:

> [Richie]
>>BeautifulSoup is perfect for this job:
> Um, BeautifulSoup may be perfect, but my script isn't.  It fails with the
> Swedish page because it doesn't cope with "<b></b>" appearing in the HTML.
> And I don't know whether you'd consider it correct to extract only the bold
> text from the entries that have bold text.  But it gives you a place to start.
> 8-)

Another option might be the HTML parser from libxml2 (

>>> import libxml2
 >>> doc = libxml2.htmlParseFile("", None) HTML parser error : htmlParseStartTag: invalid 
element name
<?xml-stylesheet href="./css/ht2html.css" type="text/css"?>
 >>> doc.serialize()
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "ht...

    Walter Dörwald

