[Python-Dev] teaching the new urllib

Wed Feb 4 00:50:44 CET 2009

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Brett Cannon wrote:
> On Tue, Feb 3, 2009 at 11:08, Brad Miller <millbr02 at luther.edu> wrote:
>> I'm just getting ready to start the semester using my new book (Python
>> Programming in Context) and noticed that I somehow missed all the changes to
>> urllib in python 3.0.  ARGH to say the least.  I like using urllib in the
>> intro class because we can get data from places that are more
>> interesting/motivating/relevant to the students.
>> Here are some of my observations on trying to do very basic stuff with
>> urllib:
>> 1.  urllib.urlopen  is now urllib.request.urlopen
> 
> Technically urllib2.urlopen became urllib.request.urlopen. See PEP
> 3108 for the details of the reorganization.
> 
>> 2.  The object returned by urlopen is no longer iterable!  no more for line
>> in url.
> 
> That is probably a difference between urllib2 and urllib.
> 
>> 3.  read, readline, readlines now return bytes objects or arrays of bytes
>> instead of a str and array of str
> 
> Correct.
> 
>> 4.  Taking the naive approach to converting a bytes object to a str does not
>> work as you would expect.
>>
>>>>> import urllib.request
>>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html')
>>>>> page
>> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>>
>>>>> line = page.readline()
>>>>> line
>> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'
>>>>> str(line)
>> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>> As you can see from the example the 'b' becomes part of the string!  It
>> seems like this should be a bug, is it?
>>
> 
> No because you are getting back the repr for the bytes object. Str
> does not know what the encoding is for the bytes so it has no way of
> performing the decoding.

The encoding information *is* available in the response headers, e.g.:

- ---------------------- %< ---------------------------------
$ wget -S --spider http://knuth.luther.edu/test.html
- --18:46:24--  http://knuth.luther.edu/test.html
           => `test.html'
Resolving knuth.luther.edu... 192.203.196.71
Connecting to knuth.luther.edu|192.203.196.71|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Tue, 03 Feb 2009 23:46:28 GMT
  Server: Apache/2.0.50 (Linux/SUSE)
  Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT
  ETag: "2fcd8-1d8-43b2bf40"
  Accept-Ranges: bytes
  Content-Length: 472
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
  Content-Type: text/html; charset=ISO-8859-1
Length: 472 [text/html]
200 OK
- ---------------------- %< ---------------------------------

So, the OP's use case *could* be satisfied, assuming that the Py3K
version of urllib sprouted a means of leveraging that header.  In this
sense, fetching the resource over HTTP is *better* than loading it from
a file:  information about the character set is explicit, and highly
likely to be correct, at least for any resource people expect to render
cleanly in a browser.

Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJiNhU+gerLs4ltQ4RAjalAKC6BcbTIFjUIBg51IbVtSd8dZsoDACggw1O
+1Zlt7RlzdieQjoAw8AeScE=
=lvtX
-----END PGP SIGNATURE-----