[Tutor] Extract information from an HTML table

Wed Aug 3 19:50:09 CEST 2005

On Wed, 3 Aug 2005, David Holland wrote:

> I would like to do the following.  I have an HTML document with a table
> with 2 columns. I would like to write a program to join the 2 columns
> together, ideally with same font etc. Ie get all the information between
> <TR> and </TR> and the next <TR> and </TR> and put it together.  Does
> anyone have any idea on how to do this ? I suppose I could remove every
> second <TR> and </TR>, any better ideas ?

Hi David,

Yes.  Unless your situtation is very dire, don't try to code this by using
regular expressions!  *grin*

Use an HTML parser for this if you can.  There are parsers in the Standard
Library.  There's also one called Beautiful Soup that I've heard very good
things about:

    http://www.crummy.com/software/BeautifulSoup/

For example:

#####
>>> import BeautifulSoup
>>> from BeautifulSoup import BeautifulSoup
>>> text = "<table><tr><td>hello</td><td>world</td></tr></table>"
>>> soup = BeautifulSoup(text)
#####

Once we have a soup, we can start walking through it:

######
>>> soup
<table><tr><td>hello</td><td>world</td></tr></table>
>>> soup.table
<table><tr><td>hello</td><td>world</td></tr></table>
>>> soup.table.tr
>>> soup.table.tr.td
<td>hello</td>
>>> soup.table.tr.td.string
'hello'
>>> soup.table.tr.td.nextSibling
<td>world</td>
>>>
>>> for td in soup.table.tr:
...     print td.string
...
hello
world
#####

So it handles a lot of the ugliness behind parsing HTML.

Good luck to you!