python3 urlopen(...).read() returns bytes
pavlovevidence at gmail.com
Mon Dec 22 23:18:44 CET 2008
On Dec 22, 3:41 pm, "Glenn G. Chappell" <glenn.chapp... at gmail.com>
> I just ran 2to3 on a py2.5 script that does pattern matching on the
> text of a web page. The resulting script crashed, because when I did
> f = urllib.request.urlopen(url)
> text = f.read()
> then "text" is a bytes object, not a string, and so I can't do a
> regexp on it.
> Of course, this is easy to patch: just do "f.read().decode()".
> However, it strikes me as an obvious bug, which ought to be fixed.
> That is, read() should return a string, as it did in py2.5.
Well, I can't agree that it's an obvious bug (in Python 3). It might
be something worth raising a warning over in 2to3. It would also be a
reasonable wishlist item for automatic encoding detection and
conversion to a string (see below). But it's not a bug.
> But apparently others disagree? This was mentioned in issue 3930
> (http://bugs.python.org/issue3930) back in September '08, but that
> issue is now closed, apparently because consistent behavior was
> achieved. But I figure consistently bad behavior is still bad.
> This change breaks pretty much every Python program that opens a
> webpage, doesn't it?
No. What if someone is using urllib retrieve (say) a JPEG image? A
bytes object is what they'd want in Python 3. Also, many people were
already explicitly dealing with encodings in Python 2.5; the change
wouldn't affect them.
> 2to3 doesn't catch it, and, in any case, why
> should read() return bytes, not string? Am I missing something?
It returns bytes because it doesn't know what encoding to use. This
is the appropriate behavior.
HOWEVER... a web page request often does know what encoding to use,
since it ostensibly has to parse the header. It's reasonable that IF
a url request's "Content-type" is text, and/or the "Content-encoding"
is given, for urllib to have an option to automatically decode and
return a string instead of bytes. (For all I know, it already can do
More information about the Python-list