[Madison] Python to accept terms and condition form a website
Hey, I am writing a code in python to access public data online (using BeautifulSoup). The task is relatively easy but the code does not get to the page I want because I need to accept the terms and condition of the website first (by a standard 'Click the Accept'). I need to tell python how to automatically accept the terms and condition and proceed to the web address specified. I am new in pyhton, my guess is that I have to use mechanize because cookielib is not good for this job. Am I right? What other resources can I use? Any link with an example similar to my problem would be great... Thanks a lot! -- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393
Mechanize's Browser object has a method, select_form, which allows you to set the browser's focus on a particular form and submit it with the "click" method. Use the select_form method's predicate argument to pass a pointer to a function you define to find the right form based on its content. It's easier than it sounds. Example code: from mechanize import Browser from urllib2 import URLError # initialize browser, set user agent browser = Browser() browser.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)')] # open the URL containing your TOS form try: browser.open('http://www.example.com/TOS.html') except URLError: print "couldn't open the page" # if your bot got a valid response if browser.viewing_html(): # if your bot found the TOS form and gave it focus if browser.select_form(predicate=find_form): # optionally set other form fields browser.form["YOUR_NAME"] = "Mr. Spider" browser.form["NUM_RECORDS"] = "35" # browser.click generates a Request object which you can pass to browser.open to submit the form. browser.open(browser.click()) print "mission complete" def find_form(form): """ The browser calls this function with each form on the page. You need to find something unique about the form you're interested in and return true if the passed-in form has it. So, in this example, your TOS form has a <input type="submit" name="TOS_BTN".../> field. Let's search for that. """ return "TOS_BTN" in form Here's the reference for mechanize forms: http://wwwsearch.sourceforge.net/mechanize/forms.html (it's a mess) Eric On Tue, May 3, 2011 at 1:04 PM, Nicola Branzoli <nbranzol@ssc.wisc.edu>wrote:
Hey, I am writing a code in python to access public data online (using BeautifulSoup). The task is relatively easy but the code does not get to the page I want because I need to accept the terms and condition of the website first (by a standard 'Click the Accept'). I need to tell python how to automatically accept the terms and condition and proceed to the web address specified. I am new in pyhton, my guess is that I have to use mechanize because cookielib is not good for this job. Am I right? What other resources can I use? Any link with an example similar to my problem would be great...
Thanks a lot!
-- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393
_______________________________________________ Madison mailing list Madison@python.org http://mail.python.org/mailman/listinfo/madison
You probably have this figured out by now, but mechanize's Browser object has a method, select_form, which allows you to set the browser's focus on a particular form and submit it with the "click" method. Use the select_form method's predicate argument to pass a pointer to a function you define to find the right form based on its content. It's easier than it sounds. Example code: from mechanize import Browser from urllib2 import URLError # initialize browser, set user agent browser = Browser() browser.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)')] # open the URL containing your TOS form try: browser.open('http://www.example.com/TOS.html') except URLError: print "couldn't open the page" # if your bot got a valid response if browser.viewing_html(): # if your bot found the TOS form and gave it focus if browser.select_form(predicate=find_form): # optionally set other form fields browser.form["YOUR_NAME"] = "Mr. Spider" browser.form["NUM_RECORDS"] = "35" # browser.click generates a Request object which you can pass to browser.open to submit the form. browser.open(browser.click()) print "mission complete" def find_form(form): """ The browser calls this function with each form on the page. You need to find something unique about the form you're interested in and return true if the passed-in form has it. So, in this example, your TOS form has a <input type="submit" name="TOS_BTN".../> field. Let's search for that. """ return "TOS_BTN" in form Here's the reference for mechanize forms: http://wwwsearch.sourceforge.net/mechanize/forms.html (it's a mess) Eric On Tue, May 3, 2011 at 1:04 PM, Nicola Branzoli <nbranzol@ssc.wisc.edu>wrote:
Hey, I am writing a code in python to access public data online (using BeautifulSoup). The task is relatively easy but the code does not get to the page I want because I need to accept the terms and condition of the website first (by a standard 'Click the Accept'). I need to tell python how to automatically accept the terms and condition and proceed to the web address specified. I am new in pyhton, my guess is that I have to use mechanize because cookielib is not good for this job. Am I right? What other resources can I use? Any link with an example similar to my problem would be great...
Thanks a lot!
-- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393
_______________________________________________ Madison mailing list Madison@python.org http://mail.python.org/mailman/listinfo/madison
Hopefully that example code displayed correctly. It showed up as quoted text for me in gmail (as though it was part of an ancestor post in the thread), so if you don't see code, expand the quoted text. Sorry about that. On Thu, May 5, 2011 at 11:02 PM, Eric Gierach <eric.gierach.dev@gmail.com>wrote:
You probably have this figured out by now, but mechanize's Browser object has a method, select_form, which allows you to set the browser's focus on a particular form and submit it with the "click" method. Use the select_form method's predicate argument to pass a pointer to a function you define to find the right form based on its content. It's easier than it sounds.
Example code:
from mechanize import Browser from urllib2 import URLError
# initialize browser, set user agent browser = Browser() browser.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)')]
# open the URL containing your TOS form try: browser.open('http://www.example.com/TOS.html') except URLError: print "couldn't open the page"
# if your bot got a valid response if browser.viewing_html(): # if your bot found the TOS form and gave it focus if browser.select_form(predicate=find_form): # optionally set other form fields browser.form["YOUR_NAME"] = "Mr. Spider" browser.form["NUM_RECORDS"] = "35" # browser.click generates a Request object which you can pass to browser.open to submit the form. browser.open(browser.click()) print "mission complete"
def find_form(form): """ The browser calls this function with each form on the page. You need to find something unique about the form you're interested in and return true if the passed-in form has it. So, in this example, your TOS form has a <input type="submit" name="TOS_BTN".../> field. Let's search for that. """ return "TOS_BTN" in form
Here's the reference for mechanize forms: http://wwwsearch.sourceforge.net/mechanize/forms.html (it's a mess)
Eric
On Tue, May 3, 2011 at 1:04 PM, Nicola Branzoli <nbranzol@ssc.wisc.edu>wrote:
Hey, I am writing a code in python to access public data online (using BeautifulSoup). The task is relatively easy but the code does not get to the page I want because I need to accept the terms and condition of the website first (by a standard 'Click the Accept'). I need to tell python how to automatically accept the terms and condition and proceed to the web address specified. I am new in pyhton, my guess is that I have to use mechanize because cookielib is not good for this job. Am I right? What other resources can I use? Any link with an example similar to my problem would be great...
Thanks a lot!
-- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393
_______________________________________________ Madison mailing list Madison@python.org http://mail.python.org/mailman/listinfo/madison
Many thanks to Eric for his suggestion. I had found a way to to solve this problem, by looking how to parse inputs to the java function __doPostBack(). The solution I found is a little naive but it works and uses urllib2, this link [1] was useful. The documentation on mechanize is a little still obscure to me... Here is a new problem I am facing: Page: http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true [2] In this page the user can enter various search criteria. Suppose I want Auction Date, From: 05/02/2011 To: 05/06/2011 Here is the way I did it (using again urllib2 because was the primising apprach given the previous success) url='http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true [3]' headers={'Cookie': 'DisclaimerCookie=yes;path=/'} values={'__EVENTTARGET':'','ctl00$mainContentArea$searchPopup$auctionDateEndTextBox':'05/06/2011','ctl00$mainContentArea$searchPopup$auctionDateBeginTextBox':'05/02/2011'} dates_data = urllib.urlencode(values) req_cusips1= urllib2.Request(url, dates_data, headers) # response_cusips = urllib2.urlopen(req_cusips1) the_cusips_page = response_cusips.read() cusips_page=BeautifulSoup(the_cusips_page) what I get back is the same page, with the values substituted in the right place. The relevant part of the page got is: Auction Date I have tried different values for __EVENTTARGET such as #ctl00$mainContentArea$marketActivitySearchLinks$serachARSLink #ctl00$mainContentArea$gridViewPagingUserControl$page1LinkButton but no results, always only the page http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true [4] and no results. A temptative way using mechanize is: import mechanize from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup url='http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true'# headers={'Cookie': 'DisclaimerCookie=yes;path=/'} request = mechanize.Request(url, headers=headers) response = mechanize.urlopen(request) the_cusips_page = response.read() cusips_page=BeautifulSoup(the_cusips_page) forms = mechanize.ParseResponse(response, backwards_compat=False) but forms is empty... Any comment would help, I find a little hard to follow the examples in mechanize. Thanks n --- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393 On Thu, 05 May 2011 23:02:37 -0500, Eric Gierach wrote: You probably have this figured out by now, but mechanize's Browser object has a method, select_form, which allows you to set the browser's focus on a particular form and submit it with the "click" method. Use the select_form method's predicate argument to pass a pointer to a function you define to find the right form based on its content. It's easier than it sounds. Example code: from mechanize import Browser from urllib2 import URLError # initialize browser, set user agent browser = Browser() browser.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)')] # open the URL containing your TOS form try: browser.open('http://www.example.com/TOS.html [5]') except URLError: print "couldn't open the page" # if your bot got a valid response if browser.viewing_html(): # if your bot found the TOS form and gave it focus if browser.select_form(predicate=find_form): # optionally set other form fields browser.form["YOUR_NAME"] = "Mr. Spider" browser.form["NUM_RECORDS"] = "35" # browser.click generates a Request object which you can pass to browser.open to submit the form. browser.open(browser.click()) print "mission complete" def find_form(form): """ The browser calls this function with each form on the page. You need to find something unique about the form you're interested in and return true if the passed-in form has it. So, in this example, your TOS form has a field. Let's search for that. """ return "TOS_BTN" in form Here's the reference for mechanize forms: http://wwwsearch.sourceforge.net/mechanize/forms.html [6] (it's a mess) Eric On Tue, May 3, 2011 at 1:04 PM, Nicola Branzoli wrote: Hey, I am writing a code in python to access public data online (using BeautifulSoup). The task is relatively easy but the code does not get to the page I want because I need to accept the terms and condition of the website first (by a standard 'Click the Accept'). I need to tell python how to automatically accept the terms and condition and proceed to the web address specified. I am new in pyhton, my guess is that I have to use mechanize because cookielib is not good for this job. Am I right? What other resources can I use? Any link with an example similar to my problem would be great... Thanks a lot! -- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393 _______________________________________________ Madison mailing list Madison@python.org [8] http://mail.python.org/mailman/listinfo/madison [9] Links: ------ [1] http://stackoverflow.com/questions/1418000/how-to-click-a-link-that-has-java... [2] http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true [3] http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true [4] http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true [5] http://www.example.com/TOS.html [6] http://wwwsearch.sourceforge.net/mechanize/forms.html [7] mailto:nbranzol@ssc.wisc.edu [8] mailto:Madison@python.org [9] http://mail.python.org/mailman/listinfo/madison
It looks like they're keeping the state of your session in a "__VIEWSTATE" hidden form field. The server might be putting you back into your current page because you aren't supplying that with your request. My suggestions: 1. Download the Live HTTP Headers Firefox Add-on, or a similar HTTP headers viewer. Then, submit the forms manually in your browser and inspect the requests being sent. It will show you what is being sent to the server when you submit a form. The page you reference actually performs some validations in JavaScript before submitting; other pages might even alter the data in JavaScript before submitting, so you want to be sure you're mimicking what the page sends to the server. 2. Use the mechanize.Browser object. By default it posts all fields in your selected form. So, fields like __VIEWSTATE that you neglect to fill will get submitted automatically. This is useful if the web application uses a lot of hidden values to keep state. Mechanize is much more thoroughly documented within its own codebase. Use the interactive python interpreter to view the module and function documentation. Example: $ python Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
from mechanize import Browser help(Browser) help(Browser.select_form)
Eric On Sun, May 8, 2011 at 12:04 PM, Nicola Branzoli <nbranzol@ssc.wisc.edu>wrote:
Many thanks to Eric for his suggestion.
I had found a way to to solve this problem, by looking how to parse inputs to the java function __doPostBack(). The solution I found is a little naive but it works and uses urllib2, this link<http://stackoverflow.com/questions/1418000/how-to-click-a-link-that-has-javascript-dopostback-in-href>was useful. The documentation on mechanize is a little still obscure to me...
Here is a new problem I am facing:
Page: http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true In this page the user can enter various search criteria. Suppose I want Auction Date, From: 05/02/2011 To: 05/06/2011 Here is the way I did it (using again urllib2 because was the primising apprach given the previous success) url='http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true'
headers={'Cookie': 'DisclaimerCookie=yes;path=/'}
values={'__EVENTTARGET':'','ctl00$mainContentArea$searchPopup$auctionDateEndTextBox':'05/06/2011','ctl00$mainContentArea$searchPopup$auctionDateBeginTextBox':'05/02/2011'}
dates_data = urllib.urlencode(values) req_cusips1= urllib2.Request(url, dates_data, headers) # response_cusips = urllib2.urlopen(req_cusips1) the_cusips_page = response_cusips.read() cusips_page=BeautifulSoup(the_cusips_page) what I get back is the same page, with the values substituted in the right place. The relevant part of the page got is: <td class="arsvrdoSearchTablelabelStyle"><span id="ctl00_mainContentArea_searchPopup_auctionDateLabel">Auction Date</span></td> <td><input name="ctl00$mainContentArea$searchPopup$auctionDateBeginTextBox" type="text" value="05/02/2011" id="ctl00_mainContentArea_searchPopup_auctionDateBeginTextBox" tabindex="8" class="arsvrdoDateWidth" /></td> <td class="percentageColWidth"><a href="javascript:" onclick="w_displayDatePicker('ctl00_mainContentArea_searchPopup_auctionDateBeginTextBox', false);return false;"> <img src="../images/calenderIcon.gif" alt="" /></a></td> <td><input name="ctl00$mainContentArea$searchPopup$auctionDateEndTextBox" type="text" value="05/06/2011" id="ctl00_mainContentArea_searchPopup_auctionDateEndTextBox" tabindex="9" class="arsvrdoDateWidth" /></td> <td class="percentageColWidth"><a href="javascript:" onclick="w_displayDatePicker('ctl00_mainContentArea_searchPopup_auctionDateEndTextBox', false);return false;"><img src="../images/calenderIcon.gif" alt="" /></a></td> I have tried different values for __EVENTTARGET such as #ctl00$mainContentArea$marketActivitySearchLinks$serachARSLink #ctl00$mainContentArea$gridViewPagingUserControl$page1LinkButton but no results, always only the page http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true and no results. A temptative way using mechanize is: import mechanize from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
url='http://emma.msrb.org/MarketActivity/RecentARS.aspx?showPopup=true'# headers={'Cookie': 'DisclaimerCookie=yes;path=/'} request = mechanize.Request(url, headers=headers) response = mechanize.urlopen(request) the_cusips_page = response.read() cusips_page=BeautifulSoup(the_cusips_page) forms = mechanize.ParseResponse(response, backwards_compat=False) but forms is empty... Any comment would help, I find a little hard to follow the examples in mechanize. Thanks n
--- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393
On Thu, 05 May 2011 23:02:37 -0500, Eric Gierach < eric.gierach.dev@gmail.com> wrote:
You probably have this figured out by now, but mechanize's Browser object has a method, select_form, which allows you to set the browser's focus on a particular form and submit it with the "click" method. Use the select_form method's predicate argument to pass a pointer to a function you define to find the right form based on its content. It's easier than it sounds. Example code: from mechanize import Browser from urllib2 import URLError
# initialize browser, set user agent browser = Browser() browser.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)')]
# open the URL containing your TOS form try: browser.open('http://www.example.com/TOS.html') except URLError: print "couldn't open the page"
# if your bot got a valid response if browser.viewing_html(): # if your bot found the TOS form and gave it focus if browser.select_form(predicate=find_form): # optionally set other form fields browser.form["YOUR_NAME"] = "Mr. Spider" browser.form["NUM_RECORDS"] = "35" # browser.click generates a Request object which you can pass to browser.open to submit the form. browser.open(browser.click()) print "mission complete"
def find_form(form): """ The browser calls this function with each form on the page. You need to find something unique about the form you're interested in and return true if the passed-in form has it. So, in this example, your TOS form has a field. Let's search for that. """ return "TOS_BTN" in form
Here's the reference for mechanize forms: http://wwwsearch.sourceforge.net/mechanize/forms.html (it's a mess) Eric
On Tue, May 3, 2011 at 1:04 PM, Nicola Branzoli <nbranzol@ssc.wisc.edu>wrote:
Hey, I am writing a code in python to access public data online (using BeautifulSoup). The task is relatively easy but the code does not get to the page I want because I need to accept the terms and condition of the website first (by a standard 'Click the Accept'). I need to tell python how to automatically accept the terms and condition and proceed to the web address specified. I am new in pyhton, my guess is that I have to use mechanize because cookielib is not good for this job. Am I right? What other resources can I use? Any link with an example similar to my problem would be great...
Thanks a lot!
-- Nicola Branzoli Ph.D. Candidate - University of Wisconsin Madison William H. Sewell Social Science Building 1180 Observatory Drive Madison, WI 53706-1393
_______________________________________________ Madison mailing list Madison@python.org http://mail.python.org/mailman/listinfo/madison
_______________________________________________ Madison mailing list Madison@python.org http://mail.python.org/mailman/listinfo/madison
participants (3)
-
Eric Gierach -
Eric Gierach -
Nicola Branzoli