[Tutor] Extract information from an HTML table

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Aug 3 19:50:09 CEST 2005

On Wed, 3 Aug 2005, David Holland wrote:

> I would like to do the following.  I have an HTML document with a table
> with 2 columns. I would like to write a program to join the 2 columns
> together, ideally with same font etc. Ie get all the information between
> <TR> and </TR> and the next <TR> and </TR> and put it together.  Does
> anyone have any idea on how to do this ? I suppose I could remove every
> second <TR> and </TR>, any better ideas ?

Hi David,

Yes.  Unless your situtation is very dire, don't try to code this by using
regular expressions!  *grin*

Use an HTML parser for this if you can.  There are parsers in the Standard
Library.  There's also one called Beautiful Soup that I've heard very good
things about:


For example:

>>> import BeautifulSoup
>>> from BeautifulSoup import BeautifulSoup
>>> text = "<table><tr><td>hello</td><td>world</td></tr></table>"
>>> soup = BeautifulSoup(text)

Once we have a soup, we can start walking through it:

>>> soup
>>> soup.table
>>> soup.table.tr
>>> soup.table.tr.td
>>> soup.table.tr.td.string
>>> soup.table.tr.td.nextSibling
>>> for td in soup.table.tr:
...     print td.string

So it handles a lot of the ugliness behind parsing HTML.

Good luck to you!

More information about the Tutor mailing list