converting an html table to a tree

Thu Aug 24 11:29:53 EDT 2000

"Alex Martelli" <alex at magenta.com> wrote in message
news:8o2lok0160q at news2.newsguy.com...
> [posted AND mailed]
>
> "Ian Lipsky" <NOSPAM at pacificnet.net> wrote in message
> news:to2p5.444$3Q6.18123 at newsread2.prod.itd.earthlink.net...
> > hi all,
> >
> > I'm completely new to python...just started reading learning python.
I've
> > got 5 days to figure out how to write a script to take an html table and
> > convert it to a tree....basically a nested array (that's my guess on how
> it
> > would be done anyhow). Oh yeah...and I also have to drive 3000 miles in
> > those 5 days ;)p
>
> Despite Python's ease, coding Python while actually driving is
> a practice to be discouraged.  Your Python code will probably
> come out all right, but your car might crash in the meantime.
>
>
> > I was hoping someone could give me a push in the right direction. What
> > functions or whatever should I look at to get this done? I saw that in
the
> > book it mentions pythons ability to grab html and parse it, as one if
the
> > pluses of the language ('internet utility modules' is what the book
called
> > it/them).
>
> Standard modules htmllib and sgmllib will indeed help you.
>
>
> > And I just found I don't have to actually go out and grab the page off a
> > webserver. It'll be a file residing on the machine where the script will
> be
> > run. I assume that'll make it a little easier for me :)
>
> Not by much, since getting data off an arbitrary URL is so easy with
> Python, but, yes, you can reduce your 'main program' to two lines:
>
>     parser.feed(open('myfile.html').read())
>     parser.close()
>
> once you have properly instantiated the 'parser' instance you need.
>
> Basically, you want to derive your class from htmllib.HTMLParser, and
> add methods to handle the tags you're specifically interested in -- for
> the problem you stated, table-related tags.
>
> For any tag-name FOO, you need to define in your class, either one method:
>     def do_foo(self, attributes):
>         # do whatever
> if the tag does not require a corresponding close-tag (e.g., <br>); or,
> more commonly, two methods:
>     def start_foo(self, attributes):
>         # opening stuff
>     def end_foo(self):
>         # closing stuff
> if both an opening and a closing tag will be there (<table> ... </table>,
> and similar cases).
>
> The 'attributes' argument is a (possibly empty) list of (name,value)
pairs.
>
> Further, you'll want to define a method in your class:
>     def handle_data(self, data):
>         # whatever
> that will receive all textual data.  Of course, you'll have flags
> you maintain on start/end methods telling you whether the data must
> simply be discarded, or how it is to be processed if relevant.
>
>
> Now to your specific case.  The tags you may want to handle are:
>
> TABLE, CAPTION, COL, COLGROUP, TBODY, TD, TFOOT, TH, THEAD, TR.
>
> (have I missed any...?).  COL, I believe, is the only one that
> does not require a closing-tag (although I think COLGROUP has an
> _optional_ closing-tag if COL elements are not nested in it, but
> I'm not sure).  Which of these tags carry significant information
> for your purposes...?
>
> The general structure might be:
>     TABLE
>         CAPTION
>         THEAD
>         TBODY
>         TFOOT
> CAPTION is optional.  So is each of THEAD, TBODY, TFOOT: if none
> is explicitly specified, TBODY is implied.  Each of THEAD, TBODY,
> TFOOT has contents:
>     THEAD|TBODY|TFOOT:
>         TR
>             TH
>             TD
> Zero or more TH and TD within each TR, zero or more TR's.
>
> I'm skipping COL, COLGROUP, and the attributes, as I think they
> are basically presentational only, and you seem interested in
> content-structuring instead.
>
>
> Now, we need more precise specs: what kinds of tables do you
> need to parse, and how do you want to structure (and output?)
> the data they contain, depending on caption/thead/tbody/tfoot
> and tr/th/td issues...?
>
> Let's take a very simple case to make things more definite.
>
> We process TABLE elements where only TBODY is interesting --
> THEAD and TFOOT, we skip silently.  Similarly, we skip TH
> and its contents too: we're only interested in:
>     TABLE
>         TBODY (may be implied)
>             TR (zero or more)
>                 TD (zero or more)
>                     data contents of TD tags only
> As a result, we return a list (Python's normal data structure
> for sequences; 'array' is very specialized in Python) where
> each element corresponds to one row (TR); each element in
> the list is another, nested, list, where each element
> corresponds to the data in a TD, in sequence.
>
> Our class will expect to be 'fed' a document fragment
> containing exactly one TABLE (the TABLE tag will have
> to be explicit), and will ignore anything outside of
> that tag as well as any redundant or nested TABLE tags
> that may also be present.  This is basically for simplicity;
> you will have to think deep about what you want to do
> in each of these cases!  And add good error diagnosis...
>
> We'll also basically assume decent nesting rather than
> go out of our way to accept peculiarly structured tables;
> this, too, will need in-depth review!
>
>
> import htmllib
> import formatter
> import string
> import pprint
>
> class TableParser(htmllib.HTMLParser):
>     def __init__(self):
>         self.active=0
>         self.finished=0
>         self.skipping=0
>         self.result=[]
>         self.current_row=[]
>         self.current_data=[]
>         htmllib.HTMLParser.__init__(
>             self, formatter.NullFormatter())
>     def start_table(self,attributes):
>         if not self.finished:
>             self.active=1
>     def end_table(self):
>         self.active=0
>         self.finished=1
>     def start_tbody(self,attributes):
>         self.skipping=0
>     def end_tbody(self):
>         self.skipping=1
>     def start_thead(self,attributes):
>         self.skipping=1
>     def end_thead(self):
>         self.skipping=0
>     def start_tfoot(self,attributes):
>         self.skipping=1
>     def end_tfoot(self):
>         self.skipping=0
>     def start_caption(self,attributes):
>         self.skipping=1
>     def end_caption(self):
>         self.skipping=0
>     def start_th(self,attributes):
>         self.skipping=self.skipping+1
>     def end_th(self):
>         self.skipping=self.skipping-1
>     def start_tr(self,attributes):
>         if self.active and not self.skipping:
>             self.current_row = []
>     def end_tr(self):
>         if self.active and not self.skipping:
>             self.result.append(self.current_row)
>     def start_td(self,attributes):
>         if self.active and not self.skipping:
>             self.current_data = []
>     def end_td(self):
>         if self.active and not self.skipping:
>             self.current_row.append(
>                 string.join(self.current_data))
>     def handle_data(self, data):
>         if self.active and not self.skipping:
>             self.current_data.append(data)
>
> def process(filename):
>     parser=TableParser()
>     parser.feed(open(filename).read())
>     parser.close()
>     return parser.result
>
> def showparse(filename):
>     pprint.pprint(process(filename))
>
> def _test():
>     return showparse('c:/atable.htm')
>
> if __name__=='__main__':
>     _test()
>
>
> With c:/atable.htm contents being, for example:
>
>
> <TABLE BORDER=1 WIDTH=80%>
> <THEAD>
> <TR>
> <TH>Heading 1</TH>
> <TH>Heading 2</TH>
> </TR>
> </THEAD>
> <TBODY>
> <TR>
> <TD>Row 1, Column 1 text.</TD>
> <TD>Row 1, Column 2 text.</TD>
> </TR>
> <TR>
> <TD>Row 2, Column 1 text.</TD>
> <TD>Row 2, Column 2 text.</TD>
> </TR>
> </TBODY>
> </TABLE>
>
>
> running the _test function will emit:
>
> >>> tableparse._test()
> [['Row 1, Column 1 text.', 'Row 1, Column 2 text.'],
>  ['Row 2, Column 1 text.', 'Row 2, Column 2 text.']]
> >>>
>
>
> I hope this gives you a somewhat usable start on
> your problem.
>
>
> Alex

WOW....where were you when i wanted someone to do my math homework for me?
:)

Thanks!! I found a post a few days back that also at least pointed me at
some of the things i'd want to look at (urlib and htmllib). You actually
gave me more info then i wanted :)
 i dont need to worry about the contents of each table cell. So unless i am
overlooking something, i'll really only need to worry about the TABLE, TR
and TD tags. I think i have to do this as though there could be an
unspecified number of tables, which shouldnt be much more complicated then
doing it if it were a specified number.

Anyhow, i'm going to give this a try later tonight hopefully. But it looks
like time is going to kill me on this one.

Oh...and i wasnt planning on driving AND coding at the same time....mainy
cuz i dont have a laptop and my desktop machine just wont fit on the front
seat with the monitor too :)p

Thanks for the help!!!