[issue21228] Missing enumeration of HTTPResponse Objects methods of urllib.request.urlopen's http.client.HTTPResponse?

New submission from Evens Fortuné: In the Python Library documentation, in section "21.6. urllib.request — Extensible library for opening URLs", in the description of the urllib.request.urlopen() function it is writen: --------- […] For http and https urls, this function returns a http.client.HTTPResponse object which has the following HTTPResponse Objects methods. For ftp, file, and data urls and requests explicity handled by legacy […] --------- The first sentence seemed to imply that something is supposed to be specified if I understand correctly. Is it me missing something ? ---------- assignee: docs@python components: Documentation messages: 216254 nosy: EvensF, docs@python priority: normal severity: normal status: open title: Missing enumeration of HTTPResponse Objects methods of urllib.request.urlopen's http.client.HTTPResponse? type: enhancement versions: Python 3.2, Python 3.3, Python 3.4, Python 3.5 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: I forgot to tell that it is in section "20.5. urllib.request — Extensible library for opening URLs" for Python 3.2.5 ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Senthil Kumaran added the comment: Hello Evens, If you can, then please attach a doc to this and we and fix this soon. ---------- nosy: +orsenthil _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: I don't quite understand what you are asking me. You need a copy of the document ? You can find an example at this link: https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Changes by Evens Fortuné <evens.fortune@gmail.com>: Added file: http://bugs.python.org/file34905/urllib.request.rst _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

R. David Murray added the comment: That section of the docs is indeed rather confusing. Probably it just needs to be changed to say "for the methods supported by this object, see HTTPResponse Objects. I'd like to see the docs reorganized so that the '.. class' declaration in the docs is immediately followed by the class's methods, but that's a much bigger project. ---------- nosy: +r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Martin Panter added the comment: I interpreted it more along the lines of “. . . returns a http.client.HTTPResponse object [with] the following [additional] methods.” Indeed, a HTTP urlopen() response I just tried does have info(), geturl() and getcode() methods, and I know the info() method is used in the real world. Also, it would be good to document that the HTTP response’s “msg” attribute does not actually hold the header, despite the HTTPResponse documentation. Further, the return value of BaseHandler.default_open() is defined to be the same as urlopen(), but when a HTTP error occurs I have found the “msg” attribute is meant to be the HTTP status text phrase (e.g. “Not Found”). Perhaps it would be good to add something like these two points wherever they belong: * The “msg” attribute returned by urlopen() does not hold the HTTP header, despite the “HTTPResponse” documentation * The “msg” attribute should be set to the HTTP status text phrase (HTTPResponse.reason) ---------- nosy: +vadmium _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: Hi,
From my limited experience reporting documentation issues, I see that it's better to submit a patch than to only report an issue. So, I've tried to look into the source code to figure out what was going on. I have attached a patch that I'm submitting to you for review hoping I doing everything right.
What was reported as ambiguous in this issue is the description of the return value of the function urllib.request.urlopen() for http and https URLs. It was mentionned that it should be an http.client.HTTPResponse object but it implied that something may have been different about this object. To understand why I'm may now be able to assert what's being said in that patch, follow me in the source code. It's based on revision c499cc2c4a06. If you don't care about all the walkthrough you can skip to step 9. 1. We want to describe the return value of the urllib.request.urlopen() for http and https URLs 2. The urlopen() function is defined in Lib/urllib/request.py starting on line 138. Its return value is the return value function of the opener.open() method (line 153) * This opener object is defined in one of these locations: * On line 150 as the return value of the module function build_opener() (this return value is cached in the _opener module variable) * On line 152 as the value cached in the _opener module variable * On line 148 still as the return value of the module function build_opener() in the case if you want to access an HTTPS URL and you provide a cafile, capath or cadefault argument * So either way, the opener object come from the build_opener directly or indirectly. 3. The build_opener() function is defined starting on line 505. Its return value (line 539) is an instance of the OpenerDirector class (line 514). The OpenerDirector class is defined starting on line 363. a. Before returning its return value, after some checks (lines 522-530, 535-536), build_opener() calls the OpenerDirector().add_handler() with an instance of some of the classes defined in the default_classes list (line 515-518). What matters to us is the HTTPHandler class and the HTTPSHandler class (line 520). b. The OpenerDirector().add_handler() method (line 375) takes the HTTPHandler class (line 1196) and: * Insert the HTTPHandler.http_open() method in the list stored as the value of OpenerDirector().handle_open['http']. * Insert the HTTPHandler.http_request() method in the list stored as the value of OpenerDirector().process_request['http']. c. For HTTPSHandler (line 1203) is the same thing but : * HTTPSHandler.https_open() for OpenerDirector().handle_open['https'] * HTTPSHandler.https_request() for OpenerDirector().process_request['https'] 4. I remind you that we are looking for the return value of the method open() of an instance of the OpenerDirector class (see point number 2). This method is defined starting on line 437. 5. The OpenDirector.open() method's return value is the response variable (line 463) 6. This variable is defined on lines 461 and 455. a. The loop on lines 458-461 tries to find in his handlers (the OpenerDirector().process_response dictionary) a response processor (a XXX.http_response() method) which isn't defined in HTTPHandler or HTTPSHandler. (a http_response() method is defined in HTTPErrorProcessor [line 564] and in HTTPCookieProcessor [line 1231] but in each of these cases, these classes don't modify the response value) b. So response variable's value is the return value of OpenerDirector()._open(req, data) on line 455. * The req argument is a Request instance (line 440) or something that has the same interface, I guess (line 442). The Request class is defined on line 253. * The data argument is included in the constructor of the Request instance (line 440 and then on line 262) or added to the object provided (line 444). Afterwards, it won't be used directly (OpenerDirector()._open() receives it as an argument but won't use it in its body) 7. OpenerDirector()._open() is defined on line 465. It will call OpenerDirector()._call_chain() up to three times depending on whether a result has been found (lines 468-469, 474-475). * OpenerDirector()._call_chain() is defined on line 426. All it does is calling the handlers registered in the dictionnary provided (the chain argument) until one returns something else than None and returns it. * In our case (retrieving http and http resources): a. The first call (line 466) will return None since HTTPHandler or HTTPSHandler don't have a default_open() method (in fact, no handler defined in this file has a default_open() method) b. The second call will work since HTTPHandler.http_open() (line 1198) and HTTPSHandler.https_open() (line 1212) exists. Their return values will be enventually what we are looking for. 8. HTTPHandler.http_open() and HTTPSHandler.https_open() returns the return value of do_open() method defined (on line 1134) in their mutual superclass AbstractHTTPHandler (line 1086). They will call it with http.client.HTTPConnection and req in the case of HTTPHandler and http.client.HTTPSConnection and req in the case of HTTPSHandler with a few other arguments. 9. OpenerDirector().do_open() creates a http.client.HTTPSConnection object (line 1144) and calls its request() method (line 1173) and if it works, calls its getreponse() method (line 1178). This return value is the HTTPResponse object we are looking for. 10. Finally we get our answer: * On line 1186, an url attribute is added to this HTTPResponse object * On line 1192, the msg attribute is replaced by the reason attribute I hope this is what was needed to close this issue. Otherwise, just tell me what is missing. Oh and there seems that there are be many things that could be refactored. Can I do it and open issues about them ? ---------- keywords: +patch Added file: http://bugs.python.org/file36539/issue21228.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Martin Panter added the comment: With this patch, there is no longer any implication that the returned object implements the “addinfourl” interface. Perhaps that should be added back. Or maybe add it to the HTTPResponse class documentation itself? There is a comment that says the methods are there “for compatibility with old-style urllib responses”, although it seems to me they also make the class compatible with Python 3’s new “urllib”: http://hg.python.org/cpython/file/c499cc2c4a06/Lib/http/client.py#l772 It is good to document the “msg” attribute and its inconsistency, since I have found this is required to implement your own BaseHandler.default_open(). However I’m not so sure if it is necessary to document the “url” attribute. Why not encourage using geturl() instead, since it is already documented, at least for non-HTTP responses? I also saw a comment against the “msg” attribute which mentions deprecating something, but it is not clear what: http://hg.python.org/cpython/file/c499cc2c4a06/Lib/urllib/request.py#l1187 ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: Well, there wasn't any indication before that the returned object was implementing the "addinfourl" interface. So I don't think we have lost anything. In what situation this interface is useful ? The following comment (that you had highlighted in your comment) gives the impression that theses methods are there only to provide compatibility with clients using old-styles responses. http://hg.python.org/cpython/file/c499cc2c4a06/Lib/http/client.py#l772 That would imply that newer clients would usually not use these methods. If you want to document this, I think the "addinfourl" interface should then be better described somewhere else where it would include the fact that the HTTPResponse class implements it. Anyway, I don't see the advantage of using a getter method (like geturl()) instead of accessing directly the attribute. For me, this is less pythonic. If you ever have to attach a behaviour to the access of this attribute, a property could then be defined. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

R. David Murray added the comment: Indeed, geturl is deprecated. I'm not sure where you see it documented, I don't see it. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: To be honest, it may be inspired by what's written a few lines lower, for ftp, files dans data urls even though the return object is not the same as the http(s) urls http://hg.python.org/cpython/file/c499cc2c4a06/Doc/library/urllib.request.rs... ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

R. David Murray added the comment: Ah, I was looking at the http docs. I wonder if we just missed the urllib docs when we made the changes. Either that, or I'm misremembering things. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Martin Panter added the comment: By removing the “addinfourl” methods for HTTP responses, you are making it unnecessarily hard to handle header fields and metadata from the response. I do not know of any other documented way of getting the eventual redirect target, other than geturl(). And code to parse the Content-Type, which I understand is also returned for file: URLs, would start to get ugly (untested): info = getattr(response, "info", None) if info: type = info().get_content_type() # Easy! else: import cgi type = cgi.parse_header(response.getheader("Content-Type))[0] Since a HTTP status code only seems to be returned for HTTP responses, deprecating or removing getcode() in favour of “HTTPResponse.status” might be okay, but I think info() and geturl() should stay. Maybe a “url” attribute is more Pythonic, but as far as I am aware that has never been documented for Python 2 or 3 for any URL type. I would not expect much existing code to be using it, and deprecating geturl() seems like a needless annoyance. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

R. David Murray added the comment: I think I'm thinking of the Request API, and not the Response API. So ignore my comments about deprecation, I'm not sure what the status is. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: Look, the subject of this issue is to clarify the methods of the urllib.request.urlopen()'s return value for http(s) URLs. Nobody seemed to work on this for 4 months. That's why I tried to submit a patch after looking into the code to try to do my part to help. If you think that my patch is not clear enough and would need to be more precise or maybe even that it is plain wrong, please submit a new patch with your ideas so that we can close eventually this issue. I really didn't think that it would get this complicated… ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Martin Panter added the comment: Fair enough, challenge accepted! Here is my attempt. I have explicitly made the info(), geturl() and getcode() methods available for all cases, and used Evens’s wording for the modified “msg” attribute, but dropped mentioning the “url” attribute. ---------- Added file: http://bugs.python.org/file36597/addinfourl-first.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Evens Fortuné added the comment: I'm satisfied with this new patch. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Martin Panter added the comment: Is there anything else that needs doing to get this committed and resolved? My patch is just a trivial reordering of Evens’s, with about one sentence deleted. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Changes by Berker Peksag <berker.peksag@gmail.com>: ---------- nosy: +berker.peksag stage: -> patch review versions: -Python 3.2, Python 3.3 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Martin Panter added the comment: Related: Issue 12707, about deprecating some methods in favour of attributes ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Roundup Robot added the comment: New changeset fa3c9faabfb0 by Martin Panter in branch '3.4': Issues #22989, #21228: Document HTTP response object for urlopen() https://hg.python.org/cpython/rev/fa3c9faabfb0 New changeset b55c006b79bc by Martin Panter in branch '3.5': Issue #22989, #21228: Merge urlopen() doc from 3.4 into 3.5 https://hg.python.org/cpython/rev/b55c006b79bc New changeset c6930661599b by Martin Panter in branch 'default': Issue #22989, #21228: Merge urlopen() doc from 3.5 https://hg.python.org/cpython/rev/c6930661599b ---------- nosy: +python-dev _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Changes by Martin Panter <vadmium+py@gmail.com>: ---------- resolution: -> fixed stage: patch review -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________

Changes by Martin Panter <vadmium+py@gmail.com>: ---------- versions: +Python 3.6 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue21228> _______________________________________
participants (6)
-
Berker Peksag
-
Evens Fortuné
-
Martin Panter
-
R. David Murray
-
Roundup Robot
-
Senthil Kumaran