[Python-Dev] teaching the new urllib
Tres Seaver
tseaver at palladion.com
Wed Feb 4 00:50:44 CET 2009
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Brett Cannon wrote:
> On Tue, Feb 3, 2009 at 11:08, Brad Miller <millbr02 at luther.edu> wrote:
>> I'm just getting ready to start the semester using my new book (Python
>> Programming in Context) and noticed that I somehow missed all the changes to
>> urllib in python 3.0. ARGH to say the least. I like using urllib in the
>> intro class because we can get data from places that are more
>> interesting/motivating/relevant to the students.
>> Here are some of my observations on trying to do very basic stuff with
>> urllib:
>> 1. urllib.urlopen is now urllib.request.urlopen
>
> Technically urllib2.urlopen became urllib.request.urlopen. See PEP
> 3108 for the details of the reorganization.
>
>> 2. The object returned by urlopen is no longer iterable! no more for line
>> in url.
>
> That is probably a difference between urllib2 and urllib.
>
>> 3. read, readline, readlines now return bytes objects or arrays of bytes
>> instead of a str and array of str
>
> Correct.
>
>> 4. Taking the naive approach to converting a bytes object to a str does not
>> work as you would expect.
>>
>>>>> import urllib.request
>>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html')
>>>>> page
>> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>>
>>>>> line = page.readline()
>>>>> line
>> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n'
>>>>> str(line)
>> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\''
>> As you can see from the example the 'b' becomes part of the string! It
>> seems like this should be a bug, is it?
>>
>
> No because you are getting back the repr for the bytes object. Str
> does not know what the encoding is for the bytes so it has no way of
> performing the decoding.
The encoding information *is* available in the response headers, e.g.:
- ---------------------- %< ---------------------------------
$ wget -S --spider http://knuth.luther.edu/test.html
- --18:46:24-- http://knuth.luther.edu/test.html
=> `test.html'
Resolving knuth.luther.edu... 192.203.196.71
Connecting to knuth.luther.edu|192.203.196.71|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Tue, 03 Feb 2009 23:46:28 GMT
Server: Apache/2.0.50 (Linux/SUSE)
Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT
ETag: "2fcd8-1d8-43b2bf40"
Accept-Ranges: bytes
Content-Length: 472
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=ISO-8859-1
Length: 472 [text/html]
200 OK
- ---------------------- %< ---------------------------------
So, the OP's use case *could* be satisfied, assuming that the Py3K
version of urllib sprouted a means of leveraging that header. In this
sense, fetching the resource over HTTP is *better* than loading it from
a file: information about the character set is explicit, and highly
likely to be correct, at least for any resource people expect to render
cleanly in a browser.
Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 tseaver at palladion.com
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFJiNhU+gerLs4ltQ4RAjalAKC6BcbTIFjUIBg51IbVtSd8dZsoDACggw1O
+1Zlt7RlzdieQjoAw8AeScE=
=lvtX
-----END PGP SIGNATURE-----
More information about the Python-Dev
mailing list