urlparse.urlunsplit should be smarter about +

This is a bug report. bugs.python.org seems to be down.
from urlparse import * urlunsplit(urlsplit('git+file:///foo/bar/baz'))
git+file:/foo/bar/baz
Note the dropped slashes after the colon.

On Sat, May 8, 2010 at 8:19 AM, David Abrahams dave@boostpro.com wrote:
This is a bug report. bugs.python.org seems to be down.
Tracked here: http://bugs.python.org/issue8656
>>> urlunsplit(urlsplit('git+file:///foo/bar/baz'))
Is 'git+file' a valid protocol? Or was it just your example? I don't see any reason for it to be invalid but I don't find authoritative references either.

On Fri, May 7, 2010 at 21:04, Senthil Kumaran orsenthil@gmail.com wrote:
On Sat, May 8, 2010 at 8:19 AM, David Abrahams dave@boostpro.com wrote:
This is a bug report. bugs.python.org seems to be down.
Tracked here: http://bugs.python.org/issue8656
urlunsplit(urlsplit('git+file:///foo/bar/baz'))
Is 'git+file' a valid protocol? Or was it just your example? I don't see any reason for it to be invalid but I don't find authoritative references either.
RFC 3986 is pretty clear on allowing '+' in scheme names, in principle if not necessarily in practice: http://tools.ietf.org/html/rfc3986#section-3.1
-- Senthil _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ddborowitz%40gmail.com

David Abrahams writes:
This is a bug report. bugs.python.org seems to be down.
from urlparse import * urlunsplit(urlsplit('git+file:///foo/bar/baz'))
git+file:/foo/bar/baz
Note the dropped slashes after the colon.
That's clearly wrong, but what does "+" have to to do with it? AFAIK, the only thing special about + in scheme names is that it's not allowed as the first character.

Stephen J. Turnbull wrote:
David Abrahams writes:
This is a bug report. bugs.python.org seems to be down.
from urlparse import * urlunsplit(urlsplit('git+file:///foo/bar/baz'))
git+file:/foo/bar/baz
Note the dropped slashes after the colon.
That's clearly wrong, but what does "+" have to to do with it? AFAIK, the only thing special about + in scheme names is that it's not allowed as the first character.
Don't you need to register the "git+file:///" url for urlparse to properly split it?
if protocol not in urlparse.uses_netloc: urlparse.uses_netloc.append(protocol)
John =:->

John Arbash Meinel writes:
Stephen J. Turnbull wrote:
David Abrahams writes:
This is a bug report. bugs.python.org seems to be down.
from urlparse import * urlunsplit(urlsplit('git+file:///foo/bar/baz'))
git+file:/foo/bar/baz
Note the dropped slashes after the colon.
That's clearly wrong, but what does "+" have to to do with it? AFAIK, the only thing special about + in scheme names is that it's not allowed as the first character.
Don't you need to register the "git+file:///" url for urlparse to properly split it?
if protocol not in urlparse.uses_netloc: urlparse.uses_netloc.append(protocol)
I don't know about the urlparse implementation, but from the point of view of the RFC I think not. Either BCP 35 or RFC 3986 (or maybe both) makes it plain that if the scheme name is followed by "://", the scheme is a hierarchical one. So that URL should parse with an empty authority, and be recomposed the same. I would do this by parsing 'git+file:///foo/bar/baz' to ('git+file', '', '/foo/bar/baz') or something like than, and 'git+file:/foo/bar/baz' to ('git+file', None, '/foo/bar/baz').
I don't see any reason why implementations should abbreviate the empty authority by removing the double slashes, unless specified in the scheme definition. Although my reading of RFC 3986 is that a missing authority (no "//") *should* be dereferenced in the same way as an empty one:
If the URI scheme defines a default for host, then that default applies when the host subcomponent is undefined or when the registered name is empty (zero length). (Sec. 3.2.2)
I don't see why urlparse should try to enforce that by converting from one to the other.

At Sat, 08 May 2010 11:04:47 -0500, John Arbash Meinel wrote:
Stephen J. Turnbull wrote:
David Abrahams writes:
This is a bug report. bugs.python.org seems to be down.
from urlparse import * urlunsplit(urlsplit('git+file:///foo/bar/baz'))
git+file:/foo/bar/baz
Note the dropped slashes after the colon.
That's clearly wrong, but what does "+" have to to do with it? AFAIK, the only thing special about + in scheme names is that it's not allowed as the first character.
Don't you need to register the "git+file:///" url for urlparse to properly split it?
Yes. But the question is whether urlparse should really be so fragile that every hierarchical scheme needs to be explicitly registered. Surely ending with “+file” should be sufficient to have it recognized as a file-based scheme

On Sun, May 09, 2010 at 03:19:40PM -0600, David Abrahams wrote:
Yes. But the question is whether urlparse should really be so fragile that every hierarchical scheme needs to be explicitly registered. Surely ending with “+file” should be sufficient to have it recognized as a file-based scheme
How do you figure?

On Sun, May 09, 2010 at 03:19:40PM -0600, David Abrahams wrote:
John Arbash Meinel wrote:
Don't you need to register the "git+file:///" url for urlparse to properly split it?
Yes. But the question is whether urlparse should really be so fragile that every hierarchical scheme needs to be explicitly registered. Surely ending with “+file” should be sufficient to have it recognized as a file-based scheme
Not all urls have the 'authority' component after the scheme. (sip based urls for e.g) urlparse differentiates those by maintaining a list of scheme names which will follow the pattern of parsing, and joining for the urls which have a netloc (or authority component). This is in general according to RFC 3986 itself.
Yes,'+' is a valid char in url schemes and svn, svn+ssh will be as per your expectations. But git and git+ssh was missing in there and I attached a patch in issue8657 to include the same. It is rightly a bug in the module. But for any general scheme and assuming '+file' would follow valid authority component, is not something I am sure that should be in urlparse's expected behavior.

Senthil Kumaran writes:
Not all urls have the 'authority' component after the scheme. (sip based urls for e.g) urlparse differentiates those by maintaining a list of scheme names which will follow the pattern of parsing, and joining for the urls which have a netloc (or authority component). This is in general according to RFC 3986 itself.
This actually quite at variance with the RFC. The grammar in section 3 doesn't make any reference to schemes as being significant in parsing. Whether an authority component is to be parsed or not is entirely dependent on the presence or absence of the "//" delimiter following the scheme and its colon delimiter. AFAICS, if the "//" delimiter is present, an authority component (possibly empty) *must* be present in the parse. Presumably an unparse should then include that empty component in the generated URI (ie, a "scheme:///..." URI).
Thus, it seems that by the RFC, regardless of any registration,
urlparse.unsplit(urlparse.split('git+file:///foo/bar'))
should produce 'git+file:///foo/bar' (or perhaps raise an error in "validation" mode). The only question is whether registration of 'git+file' as a use_netloc scheme should force
urlparse.unsplit(urlparse.split('git+file:/foo/bar'))
to return 'git+file:///foo/bar', or whether 'git+file:/foo/bar' would be acceptable (or better).
None of what I wrote here or elsewhere takes account of backward compatibility, it is true. I'm only talking about the letter of the RFC.

On Mon, May 10, 2010 at 05:11:12PM +0900, Stephen J. Turnbull wrote:
Not all urls have the 'authority' component after the scheme. (sip based urls for e.g) urlparse differentiates those by maintaining a list of scheme names which will follow the pattern of parsing, and joining for the urls which have a netloc (or authority component). This is in general according to RFC 3986 itself.
This actually quite at variance with the RFC. The grammar in section
I should have said, 'treatment of urls with authority' and 'treatment of urls without authority' in terms of parsing and joining is as per RFC. How it is doing practically is by maintaining a list of urls with known scheme names which use_netloc.

Senthil Kumaran writes:
I should have said, 'treatment of urls with authority' and 'treatment of urls without authority' in terms of parsing and joining is as per RFC. How it is doing practically is by maintaining a list of urls with known scheme names which use_netloc.
Why do that if you can get better behavior based purely on syntactic analysis?

On Mon, May 10, 2010 at 05:56:29PM +0900, Stephen J. Turnbull wrote:
Senthil Kumaran writes:
I should have said, 'treatment of urls with authority' and 'treatment of urls without authority' in terms of parsing and joining is as per RFC. How it is doing practically is by maintaining a list of urls with known scheme names which use_netloc.
Why do that if you can get better behavior based purely on syntactic analysis?
For the cases for just parsing and splitting, the syntactic behaviours are fine enough. I agree with your comments and reinstatement of RFC rules in the previous emails.
The problem as we know off, comes while unparsing and joining, ( also I have not yet looked at the relative url joining behaviour where redundant /'s can be ignored).
As you may already know, when the data is
ParseResult(scheme='file', netloc='', path='/tmp/junk.txt', params='', query='', fragment='')
You might expect the output to be file:///tmp/junk.txt Original might be same too.
But for: ParseResult(scheme='x', netloc='', path='/y', params='', query='', fragment='')
One can expect a valid output to be: x:/y
Your suggestion of netloc/authority being differentiate by '' and None seems a good one to analyze.
Also, by keeping a registry of valid schemes, are you not proposing something very similar to uses_netloc? But with a different API to handle parsing based on registry values. Is my understanding of your proposal correct?
FWIW, I looked at the history of uses_netloc list and it seems that it been there from the first version when urlparse module followed different rfc specs for different protocols (telnet, sip etc), so any changes should be carefully incorporated as not to break the existing solutions.

David Abrahams writes:
At Sat, 08 May 2010 11:04:47 -0500, John Arbash Meinel wrote:
Don't you need to register the "git+file:///" url for urlparse to properly split it?
Yes. But the question is whether urlparse should really be so fragile that every hierarchical scheme needs to be explicitly registered.
Exactly. And the answer is "no". The RFCs are quite clear that hierarchical schemes are expected to be extremely common, and provide several requirements for how they should be parsed, even by nonvalidating parsers.
It's pretty clear to me that
urlunsplit(urlsplit('git+file:///foo/bar/baz'))
should be the identity. The remaining question is, "Should
urlunsplit(urlsplit('git+file:/foo/bar/baz'))
be the identity?" I would argue that if git+file is *not* registered, it should be the identity, while there should be an optional registry of schemes which may (or perhaps should) be canonicalized (ie, a *missing* authority would be unsplit as an *empty* authority).
Surely ending with "+file" should be sufficient to have it recognized as a file-based scheme
What's a "file-based scheme"? If you mean an RFC 3986 hierarchical scheme, that is recognized by the presence of the authority portion, which is syntactically defined by the presence of "//" immediately after the scheme ":" terms. No need for any implicit magic.
In general, EIBTI applies here. If a registry as described above is implemented, I would argue that canonicalization should not happen implicitly. Nor should validation (eg, error or warning on a URI with a scheme registered as hierarchical but lacking authority, or vice versa). The API should require an explicit statement from the user to invoke those functionalities. It might be useful for the OOWTDI API to canonicalize/validate by default (especially given XSS attacks and the like), but it should be simple for consenting adults to turn that feature off.
participants (6)
-
David Abrahams
-
David Borowitz
-
John Arbash Meinel
-
Jon Ribbens
-
Senthil Kumaran
-
Stephen J. Turnbull