[Tutor] Extract information from an HTML table
dyoo at hkn.eecs.berkeley.edu
Wed Aug 3 19:50:09 CEST 2005
On Wed, 3 Aug 2005, David Holland wrote:
> I would like to do the following. I have an HTML document with a table
> with 2 columns. I would like to write a program to join the 2 columns
> together, ideally with same font etc. Ie get all the information between
> <TR> and </TR> and the next <TR> and </TR> and put it together. Does
> anyone have any idea on how to do this ? I suppose I could remove every
> second <TR> and </TR>, any better ideas ?
Yes. Unless your situtation is very dire, don't try to code this by using
regular expressions! *grin*
Use an HTML parser for this if you can. There are parsers in the Standard
Library. There's also one called Beautiful Soup that I've heard very good
>>> import BeautifulSoup
>>> from BeautifulSoup import BeautifulSoup
>>> text = "<table><tr><td>hello</td><td>world</td></tr></table>"
>>> soup = BeautifulSoup(text)
Once we have a soup, we can start walking through it:
>>> for td in soup.table.tr:
... print td.string
So it handles a lot of the ugliness behind parsing HTML.
Good luck to you!
More information about the Tutor