[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

Kent Johnson kent37 at tds.net
Tue Jul 7 13:26:10 CEST 2009


On Mon, Jul 6, 2009 at 5:54 PM, David Kim<davidkim05 at gmail.com> wrote:
> Hello all,
>
> I have two questions I'm hoping someone will have the patience to
> answer as an act of mercy.
>
> I. How to get past a Terms of Service page?
>
> I've just started learning python (have never done any programming
> prior) and am trying to figure out how to open or download a website
> to scrape data. The only problem is, whenever I try to open the link
> (via urllib2, for example) I'm after, I end up getting the HTML to a
> Terms of Service Page (where one has to click an "I Agree" button)
> rather than the actual target page.
>
> I've seen examples on the web on providing data for forms (typically
> by finding the name of the form and providing some sort of dictionary
> to fill in the form fields), but this simple act of getting past "I
> Agree" is stumping me. Can anyone save my sanity? As a workaround,
> I've been using os.popen('curl ' + url ' >' filename) to save the html
> in a txt file for later processing. I have no idea why curl works and
> urllib2, for example, doesn't (I use OS X).

curl works because it ignores the redirect to the ToS page, and the
site is (astoundingly) dumb enough to serve the content with the
redirect. You could make urllib2 behave the same way by defining a 302
handler that does nothing.

> I even tried to use Yahoo
> Pipes to try and sidestep coding anything altogether, but ended up
> looking at the same Terms of Service page anyway.
>
> Here's the code (tho it's probably not that illuminating since it's
> basically just opening a url):
>
> import urllib2
> url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
> #the first of 23 tables
> html = urllib2.urlopen(url).read()

Generally you have to post to the same url as the form, giving the
same data the form does. You can inspect the source of the form to
figure this out. In this case the form is
<form method="post" action="/products/consent.php">
<input type="hidden" value="tiwd/products/derivserv/data_table_i.php"
name="urltarget"/>
<input type="hidden" value="1" name="check_one"/>
<input type="hidden" value="tiwdata" name="tag"/>
<input type="submit" value="I Agree" name="acknowledgement"/>
<input type="submit" value="Decline" name="acknowledgement"/>
</form>

You generally need to enable cookie support in urllib2 as well,
because the site will use a cookie to flag that you saw the consent
form. This tutorial shows how to enable cookies and submit form data:
http://personalpages.tds.net/~kent37/kk/00010.html

Kent


More information about the Tutor mailing list