urllib2 (py2.6) vs urllib.request (py3)

mattia gervaz at gmail.com
Tue Mar 17 11:08:45 EDT 2009


Il Tue, 17 Mar 2009 10:55:21 +0000, R. David Murray ha scritto:

> mattia <gervaz at gmail.com> wrote:
>> Hi all, can you tell me why the module urllib.request (py3) add extra
>> characters (b'fef\r\n and \r\n0\r\n\r\n') in a simple example like the
>> following and urllib2 (py2.6) correctly not?
>> 
>> py2.6
>> >>> import urllib2
>> >>> f = urllib2.urlopen("http://www.google.com").read() fd =
>> >>> open("google26.html", "w")
>> >>> fd.write(f)
>> >>> fd.close()
>> 
>> py3
>> >>> import urllib.request
>> >>> f = urllib.request.urlopen("http://www.google.com").read() with
>> >>> open("google30.html", "w") as fd:
>> ...     print(f, file=fd)
>> ...
>> >>>
>> >>>
>> Opening the two html pages with ff I've got different results (the
>> extra characters mentioned earlier), why?
> 
> The problem isn't a difference between urllib2 and urllib.request, it is
> between fd.write and print.  This produces the same result as your first
> example:
> 
> 
>>>> import urllib.request
>>>> f = urllib.request.urlopen("http://www.google.com").read() with
>>>> open("temp3.html", "wb") as fd:
> ...     fd.write(f)
> 
> 
> The "b'....'" is the stringified representation of a bytes object, which
> is what urllib.request returns in python3.  Note the 'wb', which is a
> critical difference from the python2.6 case.  If you omit the 'b' in
> python3, it will complain that you can't write bytes to the file object.
> 
> The thing to keep in mind is that print converts its argument to string
> before writing it anywhere (that's the point of using it), and that
> bytes (or buffer) and string are very different types in python3.

In order to get the correct encoding I've come up with this:
>>> response = urllib.request.urlopen("http://www.google.com")
>>> print(response.read().decode(response.headers.get_charsets()[0]))



More information about the Python-list mailing list