[ python-Bugs-626543 ] urllib2 doesn't do HTTP-EQUIV & Refresh

Wed Feb 1 21:31:13 CET 2006

Bugs item #626543, was opened at 2002-10-21 21:57
Message generated for change (Settings changed) made by jjlee
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=626543&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: John J Lee (jjlee)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib2 doesn't do HTTP-EQUIV &amp; Refresh

Initial Comment:
I just added support for HTML's META HTTP-EQUIV and
zero-time Refresh HTTP headers to my 'ClientCookie'
package (which exports essentially a clone of the
urllib2 interface that knows about cookies, making use
of urllib2 in the implementation).  I didn't make a
patch for urllib2 itself but it would be easy to do so.
I don't plan to do this immediately, but will
eventually (assuming Jeremy thinks it's advisible) -- I
just wanted to register this fact to prevent
duplication of effort.

[BTW, this version of ClientCookie isn't on my web page
yet -- my motherboard just died.]

I'm sure you know this already, but: HTTP-EQUIV is just
a way of putting headers in the HEAD section of an HTML
document; Refresh is a Netscape 1.1 header that
indicates that a browser should redirect after a
specified time.  Refresh headers with zero time act
like redirections.

The net result of the code I just wrote is that if you
urlopen a URL that points to an HTML document like
this:

&lt;HTML&gt;&lt;HEAD&gt;
&lt;META HTTP-EQUIV=&quot;Refresh&quot; CONTENT=&quot;0; 
URL=http://acme.com/new_url.htm&quot;&gt;
&lt;/HEAD&gt;&lt;/HTML&gt;

you're automatically redirected to
&quot;http://acme.com/new_url.htm&quot;.  Same thing happens if
the Refresh is in the HTTP headers, because all the
HTTP-EQUIV headers are treated like real HTTP headers.
Refresh with non-zero delay time is ignored (the
urlopen returns the document body unchanged and does
not redirect, but does still add the Refresh header to
the HTTP headers).

A few issues:

0) AFAIK, the Refresh header is not specified in any
RFC, but only here:

http://wp.netscape.com/assist/net_sites/pushpull.html

(HTTP-EQUIV seems to be in the HTML 4.0 standard, maybe
earlier ones too)

1) Infinite loops should be detected, as for HTTP 30x?
   Presumably yes.

2) Should add HTTP-EQUIV headers to response object, or
   just treat them like headers internally?  Perhaps it
   should be possible to get both behaviours?

3) Bug in my implementation: is greedy with reading
   body data from httplib's file object.

John

----------------------------------------------------------------------

>Comment By: John J Lee (jjlee)
Date: 2006-02-01 20:31

Message:
Logged In: YES 
user_id=261020

Closing since I no longer intend to contribute this.

(I don't want to get involved with HTML parsing in the stdlib!)

----------------------------------------------------------------------

Comment By: John J Lee (jjlee)
Date: 2003-10-29 23:27

Message:
Logged In: YES 
user_id=261020

Just an update: 

- this could now be implemented as a handler (and already is, 
in my ClientCookie package) using RFE 759792, rather than 
having to be mixed in with HTTPHandler 

- the issues I listed in my initial comment, and the 
backwards-compatibility issue raised by MvL are now 
resolved 

- it needs reimplementing using HTMLParser (currently uses 
htmllib) if it's to go in the standard library; I plan to do this in 
time for 2.4 

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2002-10-26 14:30

Message:
Logged In: YES 
user_id=21627

I would try to subclass HTTPHandler, and then provide a
build_opener wrapper that installs this handler instead of
the normal http handler (the latter is optional, since the
user could just do build_opener(HTTPRefreshHandler)).

----------------------------------------------------------------------

Comment By: John J Lee (jjlee)
Date: 2002-10-24 00:20

Message:
Logged In: YES 
user_id=261020

What do you think the solution to the backwards-
compatibility problem is?  Leave urllib2 as-is?  Add a
switch to turn it on?  Something else?

At the moment, I just deal with it in AbstractHTTPHandler.
It would be nice to treat it like the other redirections, by
writing a RefreshHandler -- this would solve the backwards-
compatibility issue.  However, OpenerDirector.error always
calls http_error_xxx ATM (where xxx is the HTTP error code),
so without changing that, I don't think a RefreshHandler is
really possible.  I suppose the sensible solution is just to
make a new HTTPHandler and HTTPSHandler?

Can you think of any way in which supporting HTTP-EQUIV
would mess up backwards compatibility, assuming the body is
unchanged but the headers do have the HTTP-EQUIV headers
added?

John

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2002-10-23 14:54

Message:
Logged In: YES 
user_id=21627

In addition to the issues you have mentioned, there is also 
the backwards compatibility issue: Some applications might 
expect to get a meta-refresh document from urllib, then parse 
it and retry themselves. Those applications would break with 
such a change.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=626543&group_id=5470

[ python-Bugs-626543 ] urllib2 doesn't do HTTP-EQUIV &amp; Refresh

[ python-Bugs-626543 ] urllib2 doesn't do HTTP-EQUIV & Refresh