
"Martin v. L> > If RFC 1808 applies (the current implementation is based largely
on this spec, which did not clearly distinguish between a reference and a URI), it is what is defined in the grammar as a URL, and if it is relative (relativeURL in the grammar), it is considered to be relative to a default base URL.
This is troublesome. What is a meaningful base URL? This should be mentioned prominently.
In effect, this is what happens in the current implementation, but I don't think it was ever anyone's intent to think of it in terms of standards-based resolution-to-absolute-form against a base URL, and in any event, it's not as well-documented as it should be. User expectation in most contexts, even when it doesn't apply (as in the most prominent use of relative references: HTML/XML document processing) is that relative references are relative to a base having something to do with the current working directory of the URL processor. Wrong as it often is to make such an assumption, in the case of urlopen() we have no context that would define a base URL. The documented precedent is that the 'file' scheme is assumed, and the implementation, IIRC, is such that the relative path is run through url2pathname which does very little to it, and it is then passed right to open(), so in effect the current working directory is assumed. For the sake of having a sane policy going forward, I would rather see the behavior expressed in terms that would be governed by standards, which is what I attempted to do. Luckily, the behavior is such that it is possible. There is an issue though: if disallowed/non-ASCII characters or bytes are in the urlopen() argument, and it's a relative URL, then right now the implementation is (I think) such that those characters or bytes pass through unchanged to the open() call. So if we do anything to these characters/bytes beforehand, such as %-encoding them as I think you were suggesting (see previous email), then for compatibility we'd have to specify that we're %-decoding them again in a way that results in the original characters/bytes being passed to open().
(This mostly describes current behavior, assuming we can reach agreement that the "C:" in the example above should be treated no differently than "C|").
I have no problem with that. There are no one-letter URL schemata, are there?
There aren't, although in principle I wish the API weren't lenient; people would quickly learn that C:\x\y\z is not a URL and C:/x/y/z is only allowed by the standards to be interpreted in one way: the one they probably don't want, and what they really need to do is learn to use file:///blahblahblah. In 4Suite's Ft.Lib.Uri we needed to conduct strictly conformant processing of URI references in our DOM, XPath, XSLT, and HTTP implementations. I found that we couldn't use urllib for hardly anything of this sort without a great deal of working around / closing up the holes opened by all these 'conveniences'. Tightening up the conformance issues meant that we needed to help users produce valid URIs from filesystem paths and vice-versa. Once again, the core Python libs were of little use -- pathname2url and url2pathname are platform-dependent, and are so full of bugs^H^H^H^Hfeatures that I had to start from scratch and roll my own functions. I think what I've got at this point would make great additions to urllib2, but I'll save them for another day... At least with all the "OKs" you've given so far, I can submit a patch or three to get some of the documentation updated.
I must attend to other things right now; will comment on the other issues later.
Take your time. This has been sitting around for many releases - one more or less doesn't matter much in the global flow of things :-)
Heh, agreed. I wish rfc2396bis and IRIs would hurry on through the IETF's machinery. I've only been actively paying attention to the former, but they both have a lot going for them.