[Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

Nick Coghlan ncoghlan at gmail.com
Wed Mar 16 13:02:21 CET 2011


On Tue, Mar 15, 2011 at 11:34 PM, Guido van Rossum <guido at python.org> wrote:
>
> Can you be specific? What is different between those RFCs?

I finally got around to trying to backport some of the additional
urljoin tests from http://bugs.python.org/issue1500504 (specifically,
the additional ones Mike Brown provided), but got tripped up by the
behavioural changes between the earlier RFCs and RFC 3986 regarding
the way ".." is handled.

Even in test_urlparse, a bunch of the normative tests from RFC 3986
are commented out because they fail (by design) when run through
urllib.parse.urljoin. Some of the additional tests also fail because
our urljoin implementation has a whitelist of schemas that support
relative references, whereas 3986 expects relative references to work
for unknown schemas as well.

There's actually quite a few more terminology changes as well (as
Senthil pointed out in his email), but it was specifically the failing
test cases for urljoin semantics that bit me again yesterday.

The problem is that it is quite a lot of work to get fully general URI
parsing to work correctly, but the overlap with legacy URL parsing is
large enough that many (most?) use cases in practice work just fine
with the older RFC semantics.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list