[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

David Kim davidkim05 at gmail.com
Tue Jul 7 20:03:07 CEST 2009


Thanks Kent, perhaps I'll cool the Python jets and move on to HTTP and
HTML. I was hoping it would be something I could just pick up along
the way, looks like I was wrong.

dk

On Tue, Jul 7, 2009 at 1:56 PM, Kent Johnson<kent37 at tds.net> wrote:
> On Tue, Jul 7, 2009 at 1:20 PM, David Kim<davidkim05 at gmail.com> wrote:
>> On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnson<kent37 at tds.net> wrote:
>>>
>>> curl works because it ignores the redirect to the ToS page, and the
>>> site is (astoundingly) dumb enough to serve the content with the
>>> redirect. You could make urllib2 behave the same way by defining a 302
>>> handler that does nothing.
>>
>> Many thanks for the redirect pointer! I also found
>> http://diveintopython.org/http_web_services/redirects.html. Is the
>> handler class on this page what you mean by a handler that does
>> nothing? (It looks like it exposes the error code but still follows
>> the redirect).
>
> No, all of those examples are handling the redirect. The
> SmartRedirectHandler just captures additional status. I think you need
> something like this:
> class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
>    def http_error_301(self, req, fp, code, msg, headers):
>        return None
>
>    def http_error_302(self, req, fp, code, msg, headers):
>        return None
>
>> I guess i'm still a little confused since, if the
>> handler does nothing, won't I still go to the ToS page?
>
> No, it is the action of the handler, responding to the redirect
> request, that causes the ToS page to be fetched.
>
>> For example, I ran the following code (found at
>> http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)
>
> That is pretty similar to the DiP code...
>
>> I suspect I am not understanding something basic about how urllib2
>> deals with this redirect issue since it seems everything I try gives
>> me the same ToS page.
>
> Maybe you don't understand how redirect works in general...
>
>>> Generally you have to post to the same url as the form, giving the
>>> same data the form does. You can inspect the source of the form to
>>> figure this out. In this case the form is
>>>
>>> <form method="post" action="/products/consent.php">
>>> <input type="hidden" value="tiwd/products/derivserv/data_table_i.php"
>>> name="urltarget"/>
>>> <input type="hidden" value="1" name="check_one"/>
>>> <input type="hidden" value="tiwdata" name="tag"/>
>>> <input type="submit" value="I Agree" name="acknowledgement"/>
>>> <input type="submit" value="Decline" name="acknowledgement"/>
>>> </form>
>>>
>>> You generally need to enable cookie support in urllib2 as well,
>>> because the site will use a cookie to flag that you saw the consent
>>> form. This tutorial shows how to enable cookies and submit form data:
>>> http://personalpages.tds.net/~kent37/kk/00010.html
>>
>> I have seen the login examples where one provides values for the
>> fields username and password (thanks Kent). Given the form above,
>> however, it's unclear to me how one POSTs the form data when you
>> aren't actually passing any parameters. Perhaps this is less of a
>> Python question and more an http question (which unfortunately I know
>> nothing about either).
>
> Yes, the parameters are listed in the form.
>
> If you don't have at least a basic understanding of HTTP and HTML you
> are going to have trouble with this project...
>
> Kent
>



-- 
morenotestoself.wordpress.com


More information about the Tutor mailing list