python3 urlopen(...).read() returns bytes

Carl Banks pavlovevidence at gmail.com
Mon Dec 22 23:18:44 CET 2008


On Dec 22, 3:41 pm, "Glenn G. Chappell" <glenn.chapp... at gmail.com>
wrote:
> I just ran 2to3 on a py2.5 script that does pattern matching on the
> text of a web page. The resulting script crashed, because when I did
>
>     f = urllib.request.urlopen(url)
>     text = f.read()
>
> then "text" is a bytes object, not a string, and so I can't do a
> regexp on it.
>
> Of course, this is easy to patch: just do "f.read().decode()".
> However, it strikes me as an obvious bug, which ought to be fixed.
> That is, read() should return a string, as it did in py2.5.

Well, I can't agree that it's an obvious bug (in Python 3).  It might
be something worth raising a warning over in 2to3.  It would also be a
reasonable wishlist item for automatic encoding detection and
conversion to a string (see below).  But it's not a bug.


> But apparently others disagree? This was mentioned in issue 3930
> (http://bugs.python.org/issue3930) back in September '08, but that
> issue is now closed, apparently because consistent behavior was
> achieved. But I figure consistently bad behavior is still bad.
>
> This change breaks pretty much every Python program that opens a
> webpage, doesn't it?

No.  What if someone is using urllib retrieve (say) a JPEG image?  A
bytes object is what they'd want in Python 3.  Also, many people were
already explicitly dealing with encodings in Python 2.5; the change
wouldn't affect them.


> 2to3 doesn't catch it, and, in any case, why
> should read() return bytes, not string? Am I missing something?

It returns bytes because it doesn't know what encoding to use.  This
is the appropriate behavior.


HOWEVER... a web page request often does know what encoding to use,
since it ostensibly has to parse the header.  It's reasonable that IF
a url request's "Content-type" is text, and/or the "Content-encoding"
is given, for urllib to have an option to automatically decode and
return a string instead of bytes.  (For all I know, it already can do
that.)


Carl Banks



More information about the Python-list mailing list