[Python-Dev] teaching the new urllib

Tue Feb 3 20:56:33 CET 2009

On Tue, Feb 3, 2009 at 11:08, Brad Miller <millbr02 at luther.edu> wrote:
> I'm just getting ready to start the semester using my new book (Python
> Programming in Context) and noticed that I somehow missed all the changes to
> urllib in python 3.0.  ARGH to say the least.  I like using urllib in the
> intro class because we can get data from places that are more
> interesting/motivating/relevant to the students.
> Here are some of my observations on trying to do very basic stuff with
> urllib:
> 1.  urllib.urlopen  is now urllib.request.urlopen

Technically urllib2.urlopen became urllib.request.urlopen. See PEP
3108 for the details of the reorganization.

> 2.  The object returned by urlopen is no longer iterable!  no more for line
> in url.

That is probably a difference between urllib2 and urllib.

> 3.  read, readline, readlines now return bytes objects or arrays of bytes
> instead of a str and array of str

Correct.

> 4.  Taking the naive approach to converting a bytes object to a str does not
> work as you would expect.
>
>>>> import urllib.request
>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html')
>>>> page
> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>>
>>>> line = page.readline()
>>>> line
> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'
>>>> str(line)
> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>>>>
> As you can see from the example the 'b' becomes part of the string!  It
> seems like this should be a bug, is it?
>

No because you are getting back the repr for the bytes object. Str
does not know what the encoding is for the bytes so it has no way of
performing the decoding.

> Here's the iteration problem:
> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>>>> for line in page:
> print(line)
> Traceback (most recent call last):
>   File "<pyshell#10>", line 1, in <module>
>     for line in page:
> TypeError: 'addinfourl' object is not iterable
> Why is this not iterable anymore?  Is this too a bug?  What the heck is an
> addinfourl object?
>
> 5.  Finally, I see that a bytes object has some of the same methods as
> strings.  But the error messages are confusing.
>>>> line
> b'   "http://www.w3.org/TR/html4/loose.dtd">\n'
>>>> line.find('www')
> Traceback (most recent call last):
>   File "<pyshell#18>", line 1, in <module>
>     line.find('www')
> TypeError: expected an object with the buffer interface
>>>> line.find(b'www')
> 11
> Why couldn't find take string as a parameter?

Once again, encoding. The bytes object doesn't know what to encode the
string to in order to do an apples-to-apples search of bytes.

> If folks have advice on which, if any, of these are bugs please let me know
> and I'll file them, and if possible work on fixes for them too.

While not a bug, adding iterator support wouldn't hurt. And for the
better TypeError messages, you could try submitting a patch to change
to tack on something like "(e.g. bytes)", although I am not sure if
anyone else would agree on that decision.

> If you have advice on how I should better be teaching this new urllib that
> would be great to hear as well.

Probably the biggest issue will be having to explain string encoding.
Obviously you can gloss over it or provide students with a simple
library that just automatically converts the strings. Or even better,
provide some code for the standard library that can take the HTML,
figure out the encoding, and then return the decoded strings (might
actually already be something for that that I am not aware of).

-Brett