[Tutor] urllib confusion
Steven D'Aprano
steve at pearwood.info
Sat Nov 22 13:15:35 CET 2014
On Fri, Nov 21, 2014 at 01:37:45PM -0800, Clayton Kirkwood wrote:
> Got a general problem with url work. I've struggled through a lot of code
> which uses urllib.[parse,request]* and urllib2. First q: I read someplace in
> urllib documentation which makes it sound like either urllib or urllib2
> modules are being deprecated in 3.5. Don't know if it's only part or whole.
Can you point us to this place? I would be shocked and rather dismayed
to hear that urllib(2) was being deprecated, but it is possible that one
small component is being renamed/moved/deprecated.
> I've read through a lot that says that urllib..urlopen needs urlencode,
> and/or encode('utf-8') for byte conversion, but I've seen plenty of examples
> where nothing is being encoded either way. I also have a sneeking suspicious
> that urllib2 code does all of the encoding. I've read that if things aren't
> encoded that I will get TypeError, yet I've seen plenty of examples where
> there is no error and no encoding.
It's hard to comment and things you've read when we don't know what they
are or precisely what they say. "I read that..." is the equivalent of "a
man down the pub told me...".
If the examples are all ASCII, then no charset encoding is
needed, although urlencode will still perform percent-encoding:
py> from urllib.parse import urlencode
py> urlencode({"key": "<value>"})
'key=%3Cvalue%3E'
The characters '<' and '>' are not legal inside URLs, so they have to be
encoded as '%3C' and '%3E'. Because all the characters are ASCII, the
result remains untouched.
Non-ASCII characters, on the other hand, are encoded into UTF-8 by
default, although you can pick another encoding and/or error handler:
py> urlencode({"key": "© 2014"})
'key=%C2%A9+2014'
The copyright symbol © encoded into UTF-8 is the two bytes
\xC2\xA9 which are then percent encoded into %C2%A9.
> Why do so many examples seem to not encode? And not get TypeError? And yes,
> for those of you who are about to suggest it, I have tried a lot of things
> and read for many hours.
One actual example is worth about a thousand vague descriptions.
But in general, I would expect that the urllib functions default to
using UTF-8 as the encoding, so you don't have to manually specify an
encoding, it just works.
--
Steven
More information about the Tutor
mailing list