[Python-Dev] teaching the new urllib

Wed Feb 4 02:41:34 CET 2009

Brett Cannon <brett at python.org> wrote:

> On Tue, Feb 3, 2009 at 15:50, Tres Seaver <tseaver at palladion.com> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Brett Cannon wrote:
> >> On Tue, Feb 3, 2009 at 11:08, Brad Miller <millbr02 at luther.edu> wrote:
> >>> I'm just getting ready to start the semester using my new book (Python
> >>> Programming in Context) and noticed that I somehow missed all the changes to
> >>> urllib in python 3.0.  ARGH to say the least.  I like using urllib in the
> >>> intro class because we can get data from places that are more
> >>> interesting/motivating/relevant to the students.
> >>> Here are some of my observations on trying to do very basic stuff with
> >>> urllib:
> >>> 1.  urllib.urlopen  is now urllib.request.urlopen
> >>
> >> Technically urllib2.urlopen became urllib.request.urlopen. See PEP
> >> 3108 for the details of the reorganization.
> >>
> >>> 2.  The object returned by urlopen is no longer iterable!  no more for line
> >>> in url.
> >>
> >> That is probably a difference between urllib2 and urllib.
> >>
> >>> 3.  read, readline, readlines now return bytes objects or arrays of bytes
> >>> instead of a str and array of str
> >>
> >> Correct.
> >>
> >>> 4.  Taking the naive approach to converting a bytes object to a str does not
> >>> work as you would expect.
> >>>
> >>>>>> import urllib.request
> >>>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html')
> >>>>>> page
> >>> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>>
> >>>>>> line = page.readline()
> >>>>>> line
> >>> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'
> >>>>>> str(line)
> >>> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
> >>> As you can see from the example the 'b' becomes part of the string!  It
> >>> seems like this should be a bug, is it?
> >>>
> >>
> >> No because you are getting back the repr for the bytes object. Str
> >> does not know what the encoding is for the bytes so it has no way of
> >> performing the decoding.
> >
> > The encoding information *is* available in the response headers, e.g.:
> >
> > - ---------------------- %< ---------------------------------
> > $ wget -S --spider http://knuth.luther.edu/test.html
> > - --18:46:24--  http://knuth.luther.edu/test.html
> >           => `test.html'
> > Resolving knuth.luther.edu... 192.203.196.71
> > Connecting to knuth.luther.edu|192.203.196.71|:80... connected.
> > HTTP request sent, awaiting response...
> >  HTTP/1.1 200 OK
> >  Date: Tue, 03 Feb 2009 23:46:28 GMT
> >  Server: Apache/2.0.50 (Linux/SUSE)
> >  Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT
> >  ETag: "2fcd8-1d8-43b2bf40"
> >  Accept-Ranges: bytes
> >  Content-Length: 472
> >  Keep-Alive: timeout=15, max=100
> >  Connection: Keep-Alive
> >  Content-Type: text/html; charset=ISO-8859-1
> > Length: 472 [text/html]
> > 200 OK
> > - ---------------------- %< ---------------------------------
> >
> 
> Right, but he was asking about why passing bytes to str() led to it
> returning the repr.
> 
> > So, the OP's use case *could* be satisfied, assuming that the Py3K
> > version of urllib sprouted a means of leveraging that header.  In this
> > sense, fetching the resource over HTTP is *better* than loading it from
> > a file:  information about the character set is explicit, and highly
> > likely to be correct, at least for any resource people expect to render
> > cleanly in a browser.
> 
> Right. And even if the header lacks the info as Content-Type is not
> guaranteed to contain the charset there is also the chance for the
> HTML or DOCTYPE declaration to say.
> 
> But as Bill pointed out, urllib just fetches data via HTTP, so a
> character encoding will not always be valuable. Best solution would be
> to provide something in html that can take what urllib.request.urlopen
> returns and handle the decoding.

Yes, that sounds like the right solution to me, too.

Bill