[Tutor] Parsing HTML ... where to start?

Thu Mar 13 01:49:01 2003

> Eventually, many miles/km down the road, I'd like to
> be able to download my personal financial information
> from a web site, storing various balance numbers in
> variables that I can go on to manipulate, display,
> etc. The web sites will require passwords to access
> and, presumably, are https://
> 

I just did that very project Sunday night! I now have a cron job that
fetches the balances of two bank accounts every day and puts them in a
log for my enjoyment and analysis.

The hurdles for my bank's site were SSL (of course), and a cookie that
had to be presented on each page access. My program submits the login
page with my user/passwd; "presses a button" on the next page that
appears; then fetches the contents of another frame.  I quickly tear
up the HTML of that result to get the balances I want. Then I "press"
the logout button.  All accesses are http POST operations.

I used httpsession from http://webunit.sourceforge.net/ because it does
do SSL with cookies, whereas the stdlib modules don't have an automatic
cookie system that I could find. The core of my program is this class:

import httpsession
class Pagefetcher:
    def __init__(self,hostname):
        self.sess=httpsession.HTTPSession(debug_level=0,use_cookies=1)
        self.sess.add_header('user-agent','auto balance fetcher by drewp@bigasterisk.com')
        self.sess.add_header('Host',hostname)

    def fetchpage(self,formdict,url):
        """url should include the hostname given above. formdict is a
        python dict of form names and values to be POSTed to the
        site. the result page is returned."""
        req=self.sess.post(url)
        [req.add_param(k,v) for k,v in formdict.items()]
        pagedata=req.getfile()
        return pagedata.read()

My main code is specific to my bank, of course. It's just a sequence
of fetchpage() calls with the right form variables. I send the
last page through my friend's table extracting module TableParse.py
(http://bebop.bigasterisk.com/python/). That module works for me, but
htmllib or HTMLParser from the stdlib might be better choices.

Finally, you should be aware of recording proxies. These are http proxy
programs that you point your browser to, and they record all the requests
and returned page data for later analysis. The advantage is that you
can use an ordinary browser, surf for the data you want, and quickly
get an automatic trace of what urls need to be fetched.  If there's
dynamic data in the requests, you'd replace it with a dynamically
inserted value, etc. The disadvantage is that these proxies can't be
used with SSL- that's the point of SSL: an intermediate program can't
see the unencrypted data. So I, and probably you, will have to work out
the requests one at a time by hand.

-Drew