[Python-Dev] urlparse.urlunsplit should be smarter about +
Stephen J. Turnbull
stephen at xemacs.org
Sun May 9 14:15:38 CEST 2010
John Arbash Meinel writes:
> Stephen J. Turnbull wrote:
> > David Abrahams writes:
> > >
> > > This is a bug report. bugs.python.org seems to be down.
> > >
> > > >>> from urlparse import *
> > > >>> urlunsplit(urlsplit('git+file:///foo/bar/baz'))
> > > git+file:/foo/bar/baz
> > >
> > > Note the dropped slashes after the colon.
> >
> > That's clearly wrong, but what does "+" have to to do with it? AFAIK,
> > the only thing special about + in scheme names is that it's not
> > allowed as the first character.
>
> Don't you need to register the "git+file:///" url for urlparse to
> properly split it?
>
> if protocol not in urlparse.uses_netloc:
> urlparse.uses_netloc.append(protocol)
I don't know about the urlparse implementation, but from the point of
view of the RFC I think not. Either BCP 35 or RFC 3986 (or maybe both)
makes it plain that if the scheme name is followed by "://", the
scheme is a hierarchical one. So that URL should parse with an empty
authority, and be recomposed the same. I would do this by parsing
'git+file:///foo/bar/baz' to ('git+file', '', '/foo/bar/baz') or
something like than, and 'git+file:/foo/bar/baz' to ('git+file', None,
'/foo/bar/baz').
I don't see any reason why implementations should abbreviate the empty
authority by removing the double slashes, unless specified in the
scheme definition. Although my reading of RFC 3986 is that a missing
authority (no "//") *should* be dereferenced in the same way as an
empty one:
If the URI scheme defines a default for host, then that default
applies when the host subcomponent is undefined or when the
registered name is empty (zero length). (Sec. 3.2.2)
I don't see why urlparse should try to enforce that by converting from
one to the other.
More information about the Python-Dev
mailing list