getting tables out

Mon May 24 15:50:49 EDT 1999

Michael Spalinski <mspal at sangria.harvard.edu> writes:

> I would like to write a Python script that would read an HTML document and
> extract table contents from it. Eg. each table could be a list of tuples
> with data from the rows. I thought htmllib would provide the basic tools
> for this, but I can't find any example that would be of use. 
> 
> So - does anyone have a Python snippet that looks for tables and gets at
> the data?

I know there have been several responses -- but as a compulsive
minimalist, I just couldn't resist trying to make a small solution...

------ start table parser ------

from re import compile, findall, I, S

flags = I+S
tpat = compile("<table[^>]*>.*?</table>",flags)
rpat = compile("<tr[^>]>.*?</tr>",flags)
dpat = compile("<td[^>]>(.*?)</td>",flags)

data = open("data.html").read()
result = []

for table in findall(tpat,data):
    result.append([])
    for row in findall(rpat,table):
        result[-1].append([])
        for cell in findall(dpat,row):
            result[-1][-1].append(cell)
        result[-1][-1] = tuple(result[-1][-1])

------- stop table parser -------

> 
> M.

--

  Magnus
  Lie
  Hetland        http://arcadia.laiv.org <arcadia at laiv.org>