How to read information from tables in HTML?
stefan.behnel-n05pAM at web.de
Fri Aug 3 13:23:26 CEST 2007
> I'm confronted with some trouble when dealing with html files.
> And it seems that they're not well-formed, when parsed with minidom, it
> will say "mismatched tag".
minidom deals with XML. You're trying to read something that's (similar to)
HTML. HTML is much less strict.
> Then how can i get information from those files? Is there any useful
> library for me?
BeautifulSoup or lxml.html (which supports the BeautifulSoup parser, btw).
Both can deal with broken HTML, but lxml.html has better support for cleaning
The lxml.html package is not currently in an official lxml release, but you
can install it from SVN sources:
A release is expected soon.
More information about the Python-list