[Python-Dev] urllib.quote and unicode bug resuscitation attempt
mike at skew.org
Thu Jul 13 10:26:43 CEST 2006
Stefan Rank wrote:
> on 12.07.2006 07:53 Martin v. Löwis said the following:
> > Anthony Baxter wrote:
> >>> The right thing to do is IRIs.
> >> For 2.5, should we at least detect that it's unicode and raise a
> >> useful error?
> > That can certainly be done, sure.
> > Martin
> That would be great.
> And I agree that updating urllib.quote for unicode should be part of a
> grand plan that updates all of urllib and introduces an irilib /
> urischemes / uriparse module in 2.6 as Martin and John J Lee suggested.
Put me down as +1 on raising a useful error instead of a KeyError or whatever,
and +1 on having an irilib, but -1 on working toward accepting unicode in the
URI-oriented urllib.quote(), because (a.) user expectations for strings that
contain non-ASCII-range characters will vary, and (b.) percent-encoding is
supposed to only operate on a byte-encoded view of non-URI information, not
the information itself.
I, too, initially thought that quote() was outdated since it choked on unicode
input, but eventually I came to realize that it's wise to reject such input,
because to attempt to percent-encode characters, rather than bytes, reflects a
fundamental misunderstanding of the level at which percent-encoding is
intended to operate.
This is one of the hardest aspects of URI processing to grok, and I'm not
very good at explaining it, even though I've tried my best in the Wikipedia
articles. It's basically these 3 points:
1. A URI can only consist of 'unreserved' characters, as I'm sure you know.
It's a specific set that has varied slightly over the years, and is a subset
of printable ASCII.
2. A URI scheme is essentially a mapping of non-URI information to a sequence
of URI characters. That is, it is a method of producing a URI from non-URI
information within a particular information domain ...and vice-versa.
3. A URI scheme should (though may not do so very clearly, especially the
older it is!) tell you that the way to represent a particular bit of non-URI
information, 'info', in a URI is to convert_to_bytes(info), and then, as per
STD 66, make the bytes that correspond, in ASCII, to unreserved characters
manifest as those characters, and all others manifest as their percent-encoded
equivalents. In urllib parlance, this step is 'quoting' the bytes.
3.1. [This isn't crucial to my argument, but has to be mentioned to complete
the explanation of percent-encoding.] In addition, those bytes corresponding,
in ASCII, to some 'reserved' characters are exempt from needing to be
percent-encoded, so long as they're not being used for their reserved purpose
(if any) in whatever URI component they're going into -- Semantically, there's
no difference between such bytes when expressed in the URI as a literal
reserved character or as a percent-encoded byte. URI scheme specs vary greatly
in how they deal with this nuance. In any case, urllib.quote() has the 'safe'
argument which can be used to specify the exempt reserved characters.
In the days when the specs that urllib was based on were relevant, 99% of the
time, the bytes being 'quoted' were ASCII-encoded strings representing ASCII
character-based non-URI information, so quite a few of us, including many URI
scheme authors, were tempted to think that what was being
'quoted'/percent-encoded *was* the original non-URI information, rather than a
bytewise view of it mandated by a URI scheme. That's what I was doing when I
thought that quote(some_unicode_path) should 'work', especially in light of
Python's "treat all strings alike" guideline. But if you accept all of the
above, which is what I believe the standard requires, then unicode input is a
very different situation from str input; it's unclear whether and how the
caller wants the input to be converted to bytes, if they even understand what
they're doing at all.
See, right now, quote('abc 123%') returns 'abc%20123%25', as you would expect.
Similarly, everyone would probably expect u'abc 123%' to return
u'abc%20123%25', and if we were to implement that, there'd probably be no harm
But look at quote('\xb7'), which, assuming you accept everything I've said
above is correct, rightfully returns '%B7'. What would someone expect
quote(u'\xb7') to return? Some might want u'%B7' because they want the same
result type as the input they gave, with no other changes from how it would
normally be handled. Some might want u'%C2%B7' because they're conflating the
levels of abstraction and expect, say, UTF-8 conversion to be done on their
input. Some (like me) might want a TypeError or ValueError because we
shouldn't be handing such ambiguous data to quote() in the first place. And
then there's the u'\u0100'-and-up input to worry about; what does a user
expect to be done with that?
I would prefer to see quote() always reject unicode input with a TypeError.
Alternatively, if it accepts unicode, it should produce unicode, and since it
can only reasonably assume what the user wants done with ASCII-range
characters, it should only accept input < u'\x80'.
In any case, quote() should be better documented to explain what it accepts
( a byte sequence )
( it is intended to be used at the stage of URI production where non-URI
info, such as a unicode filesystem path, has already been converted to bytes
according to the requirements of a URI scheme, and now needs to be represented
as a URI-safe character sequence )
and exactly what it produces
( a str representing URI character s).
More information about the Python-Dev