retrieving https pages

Mike Meyer mwm at mired.org
Wed Jul 20 05:13:33 CEST 2005


Eric <BorgMotherShip at AliensR_US.org> writes:

> I'm using Linux - Manriva LE2005, python 2.3 (or i can also use python 2.4
> on my other system just as well).
> Anyways...
> I want to get a web page containing my stock grants.
> The initial page is an https and there is a form on it to
> fill in your username and password and then click "login"
> I played with python's urlopen and basically it complains "your browser
> doesnt support frames" meaning the urlopen call makes it unhappy somehow.
> Is it reasonable to think i can build a script to login to this secure
> website, move to a different page (on that site) and download it to disk?
> Or am i just looking at a ling complicated task.

It's not that bad. It took me about half a day to do this for a site I
wanted scraped regularly, and what I had to do was much more
complicated than what you describe. I had to deal with an optional
second login page (a "security feature" of the site), http-equiv
redirects (which urlopen doesn't handle), and then digging the URL of
the page I wanted to get information from from the resulting page.

The complaint about your browser may be their inadequate attempt to
deal with browser portability by putting that on the resulting framed
page in the NOFRAMES element. In which case, you just need to find the
URL for the frame that's got the information you want, and get that
page. On the other hand, as Wes said, they may be browser-sniffing. In
which case you'll have to set the User-Agent to something they won't
complain about. Personally, I always try "Your Web Site Developer
Sucks" to see if they have a list of disallowed browsers. If that
fails, try the User-Agent string of a well-known browser.

For page scraping, install BeautifulSoup.

     <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list