[Python-Dev] urlparse.urlunsplit should be smarter about +

Stephen J. Turnbull stephen at xemacs.org
Sun May 9 14:15:38 CEST 2010


John Arbash Meinel writes:
 > Stephen J. Turnbull wrote:
 > > David Abrahams writes:
 > >  > 
 > >  > This is a bug report.  bugs.python.org seems to be down.
 > >  > 
 > >  >   >>> from urlparse import *
 > >  >   >>> urlunsplit(urlsplit('git+file:///foo/bar/baz'))
 > >  >   git+file:/foo/bar/baz
 > >  > 
 > >  > Note the dropped slashes after the colon.
 > > 
 > > That's clearly wrong, but what does "+" have to to do with it?  AFAIK,
 > > the only thing special about + in scheme names is that it's not
 > > allowed as the first character.
 > 
 > Don't you need to register the "git+file:///" url for urlparse to
 > properly split it?
 > 
 >     if protocol not in urlparse.uses_netloc:
 >         urlparse.uses_netloc.append(protocol)

I don't know about the urlparse implementation, but from the point of
view of the RFC I think not.  Either BCP 35 or RFC 3986 (or maybe both)
makes it plain that if the scheme name is followed by "://", the
scheme is a hierarchical one.  So that URL should parse with an empty
authority, and be recomposed the same.  I would do this by parsing
'git+file:///foo/bar/baz' to ('git+file', '', '/foo/bar/baz') or
something like than, and 'git+file:/foo/bar/baz' to ('git+file', None,
'/foo/bar/baz').

I don't see any reason why implementations should abbreviate the empty
authority by removing the double slashes, unless specified in the
scheme definition.  Although my reading of RFC 3986 is that a missing
authority (no "//") *should* be dereferenced in the same way as an
empty one:

    If the URI scheme defines a default for host, then that default
    applies when the host subcomponent is undefined or when the
    registered name is empty (zero length).  (Sec. 3.2.2)

I don't see why urlparse should try to enforce that by converting from
one to the other.


More information about the Python-Dev mailing list