[Tutor] Extract information from an HTML table
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Wed Aug 3 19:50:09 CEST 2005
On Wed, 3 Aug 2005, David Holland wrote:
> I would like to do the following. I have an HTML document with a table
> with 2 columns. I would like to write a program to join the 2 columns
> together, ideally with same font etc. Ie get all the information between
> <TR> and </TR> and the next <TR> and </TR> and put it together. Does
> anyone have any idea on how to do this ? I suppose I could remove every
> second <TR> and </TR>, any better ideas ?
Hi David,
Yes. Unless your situtation is very dire, don't try to code this by using
regular expressions! *grin*
Use an HTML parser for this if you can. There are parsers in the Standard
Library. There's also one called Beautiful Soup that I've heard very good
things about:
http://www.crummy.com/software/BeautifulSoup/
For example:
#####
>>> import BeautifulSoup
>>> from BeautifulSoup import BeautifulSoup
>>> text = "<table><tr><td>hello</td><td>world</td></tr></table>"
>>> soup = BeautifulSoup(text)
#####
Once we have a soup, we can start walking through it:
######
>>> soup
<table><tr><td>hello</td><td>world</td></tr></table>
>>> soup.table
<table><tr><td>hello</td><td>world</td></tr></table>
>>> soup.table.tr
>>> soup.table.tr.td
<td>hello</td>
>>> soup.table.tr.td.string
'hello'
>>> soup.table.tr.td.nextSibling
<td>world</td>
>>>
>>> for td in soup.table.tr:
... print td.string
...
hello
world
#####
So it handles a lot of the ugliness behind parsing HTML.
Good luck to you!
More information about the Tutor
mailing list