[ python-Bugs-1424148 ] urllib.FancyURLopener.redirect_internal looses data on POST!

Mon Feb 6 21:52:25 CET 2006

Bugs item #1424148, was opened at 2006-02-04 12:35
Message generated for change (Comment added) made by jimjjewett
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1424148&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 6
Submitted By: Robert Kiendl (kxroberto)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib.FancyURLopener.redirect_internal looses data on POST!

Initial Comment:
    def redirect_internal(self, url, fp, errcode,
errmsg, headers, data):
        if 'location' in headers:
            newurl = headers['location']
        elif 'uri' in headers:
            newurl = headers['uri']
        else:
            return
        void = fp.read()
        fp.close()
        # In case the server sent a relative URL, join
with original:
        newurl = basejoin(self.type + ":" + url, newurl)
        return self.open(newurl)

... has to become ...

    def redirect_internal(self, url, fp, errcode,
errmsg, headers, data):
        if 'location' in headers:
            newurl = headers['location']
        elif 'uri' in headers:
            newurl = headers['uri']
        else:
            return
        void = fp.read()
        fp.close()
        # In case the server sent a relative URL, join
with original:
        newurl = basejoin(self.type + ":" + url, newurl)
        return self.open(newurl,data)

... i guess?   (  ",data"  added )

Robert

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2006-02-06 15:52

Message:
Logged In: YES 
user_id=764593

Sorry, I was trying to provide a quick explanation of why we 
couldn't just "do the obvious thing" and repost with data.

Yes, I realize that in practice, GET is used for non-
idempotent actions, and POST is (though less often) done 
automatically.

But since that is the official policy, I wouldn't want to 
bet too heavily against it in a courtroom -- so python 
defaults should be at least as conservative as both the spec 
and the common practice.  

----------------------------------------------------------------------

Comment By: John J Lee (jjlee)
Date: 2006-02-06 15:24

Message:
Logged In: YES 
user_id=261020

First, anyone replying to this, *please* read this page (and
the whole of this tracker note!) first:

http://ppewww.ph.gla.ac.uk/~flavell/www/post-redirect.html

kxroberto: you say that with standard urllibX error handling
you cannot get an exception on redirected 301/302/307 POST.
 That's not true of urllib2, since you may override
HTTPRedirectHandler.redirect_request(), which method was
designed and documented for precisely that purpose.  It
seems sensible to have a default that does what virtually
all browsers do (speaking as a long-time lynx user!).  I
don't know about the urllib case.

It's perfectly reasonable to extend urllib (if necessary) to
allow the option of raising an exception.  Note that (IIRC!)
 urllib's exceptions do not contain the response body data,
however (urllib2's HTTPErrors do contain the response body
data).

It would of course break backwards compatibility to start
raising exceptions by default here.  I don't think it's
reasonable to break old code on the basis of a notional
security issue when the de-facto standard web client
behaviour is to do the redirect.  In reality, the the only
"security" value of the original prescriptive rule was as a
convention to be followed by white-hat web programmers and
web client implementors to help users avoid unintentionally
re-submitting non-idempotent requests.  Since that
convention is NOT followed in the real world (lynx doesn't
count as the real world ;-), I see no value in sticking
rigidly to the original RFC spec -- especially when 2616
even provides 307 precisely in response to this problem. 
Other web client libraries, for example libwww-perl and Java
HTTPClient, do the same as Python here IIRC.  RFC 2616
section 10.3.4 even suggests web programmers use 302 to get
the behaviour you complain about!

The only doubtful case here is 301.  A decision was made on
the default behaviour in that case back when the tracker
item I pointed you to was resolved.  I think it's a mistake
to change our minds again on that default behaviour.

kxroberto.seek(nrBytes)
assert kxroberto.readline() == """\
To redirect POST as GET _while_ simply loosing (!) the data
(and not appending it to the GET-URL) is most bad for a lib."""

No.  There is no value in supporting behaviour which is
simply contrary to both de-facto and prescriptive standards
(see final paragraph of RFC 2616 section 10.3.3: if we
accept the "GET on POST redirect" rule, we must accept that
the Location header is exactly the URL that should be
followed).  FYI, many servers return a redirect URL
containing the urlencoded POST data from the original request.

kxroberto: """Don't know if the MS & netscape's also
transpose to GET with long data? ..."""

urllib2's behaviour (and urllib's, I believe) on these
issues is identical to that of IE and Firefox.

jimjewett: """In theory, a GET may be automatic, but a POST
requires user interaction, so the user can be held
accountable for the results of a POST, but not of a GET."""

That theory has been experimentally falsified ;-)

----------------------------------------------------------------------

Comment By: Jim Jewett (jimjjewett)
Date: 2006-02-06 12:57

Message:
Logged In: YES 
user_id=764593

In theory, a GET may be automatic, but a POST requires user 
interaction, so the user can be held accountable for the 
results of a POST, but not of a GET.

Often, the page will respond to either; not sending the 
queries protects privacy in case of problems, and works more 
often than not.  (That said, I too would prefer a raised 
error or a transparent repost, at least as options.)

----------------------------------------------------------------------

Comment By: Robert Kiendl (kxroberto)
Date: 2006-02-06 05:29

Message:
Logged In: YES 
user_id=972995

> http://python.org/sf/549151

the analyzation of the browsers is right. lynx is best ok to
ask.
But urllibX is not a browser (application) but a lib: As of
now with standard urllibX error handling you cannot code a lynx.

gvr's initial suggestion to raise a clear error (with
redirection-link as attribute of the exception value) is
best ok. Another option would be to simly yield the
undirected stub HTML and leave the 30X-code (and redirection
LOCATION in header).

To redirect POST as GET _while_ simply loosing (!) the data
(and not appending it to the GET-URL) is most bad for a lib.
Transcribing smart a short formlike POST to a GET w QUERY
would be so la la.
Don't know if the MS & netscape's also transpose to GET with
long data? ...

The current behaviour is most worst of all 4. All other
methods whould at least have raisen an early hint/error in
my case.

----------------------------------------------------------------------

Comment By: John J Lee (jjlee)
Date: 2006-02-05 19:54

Message:
Logged In: YES 
user_id=261020

This is not a bug.
See the long discussion here:
http://python.org/sf/549151

----------------------------------------------------------------------

Comment By: Robert Kiendl (kxroberto)
Date: 2006-02-04 15:10

Message:
Logged In: YES 
user_id=972995

Found http://www.faqs.org/rfcs/rfc2616.html (below).
But the behaviour is still strange, and the bug even more
serious: a silent redirection of a POST as GET without data
is obscure for a Python language. Leads to unpredictable
results. The cut half execution is not stopable and all is
left to a good reaction of the server, and complex
reinterpreation of the client. Python urllibX should by
default yield the 30X code for a POST redirection and
provide the first HTML: usually a redirection HTML stub with
< a href=...
That would be consistent with the RFC: the User
(=Application! not Python!) can redirect under full control
without generating a wrong call! In my application, a bug
was long unseen because of this wrong behaviour. with
30X-stub it would have been easy to discover and understand ...

urllib2 has the same bug with POST redirection.

=======
10.3.2 301 Moved Permanently

   The requested resource has been assigned a new permanent
URI and any
   future references to this resource SHOULD use one of the
returned
   URIs.  Clients with link editing capabilities ought to
automatically
   re-link references to the Request-URI to one or more of
the new
   references returned by the server, where possible. This
response is
   cacheable unless indicated otherwise.

   The new permanent URI SHOULD be given by the Location
field in the
   response. Unless the request method was HEAD, the entity
of the
   response SHOULD contain a short hypertext note with a
hyperlink to
   the new URI(s).

   If the 301 status code is received in response to a
request other
   than GET or HEAD, the user agent MUST NOT automatically
redirect the
   request unless it can be confirmed by the user, since
this might
   change the conditions under which the request was issued.

      Note: When automatically redirecting a POST request after
      receiving a 301 status code, some existing HTTP/1.0
user agents
      will erroneously change it into a GET request.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1424148&group_id=5470