getting tables out
Magnus L. Hetland
mlh at idt.ntnu.no
Mon May 24 15:50:49 EDT 1999
Michael Spalinski <mspal at sangria.harvard.edu> writes:
> I would like to write a Python script that would read an HTML document and
> extract table contents from it. Eg. each table could be a list of tuples
> with data from the rows. I thought htmllib would provide the basic tools
> for this, but I can't find any example that would be of use.
>
> So - does anyone have a Python snippet that looks for tables and gets at
> the data?
I know there have been several responses -- but as a compulsive
minimalist, I just couldn't resist trying to make a small solution...
------ start table parser ------
from re import compile, findall, I, S
flags = I+S
tpat = compile("<table[^>]*>.*?</table>",flags)
rpat = compile("<tr[^>]>.*?</tr>",flags)
dpat = compile("<td[^>]>(.*?)</td>",flags)
data = open("data.html").read()
result = []
for table in findall(tpat,data):
result.append([])
for row in findall(rpat,table):
result[-1].append([])
for cell in findall(dpat,row):
result[-1][-1].append(cell)
result[-1][-1] = tuple(result[-1][-1])
------- stop table parser -------
>
> M.
--
Magnus
Lie
Hetland http://arcadia.laiv.org <arcadia at laiv.org>
More information about the Python-list
mailing list