Re: URL processing conformance and principles (was Re: urllib.urlopen...)

{I hit sent too early, here is the rest } Mike Brown wrote:
Right. This part of the thread was just about how the argument to urllib.urlopen() should be handled when given as unicode vs str. You seemed to be saying it should be str because a URI is fundamentally bytes and should be analyzed as such, whereas I'm saying no, a URI is fundamentally characters and should be analyzed as such. I mentioned %-encoding and the quirk of the BNF just because those are aspects of the syntax that are byte-oriented and are the source of much confusion, and because they may have influenced your assertion.
Are we in agreement on these points?
I think I have to answer "no". The % notation is not a quirk of the BNF. I.e. when the BNF states that an URI contains %AC (say), this does *not* mean that the actual URI in-memory-or-on-the-wire contains the byte \xAC. The spec actually says that the URI, in memory, on the wire, or on paper, actually contains the three character '%', 'A', and 'C'. So usage of that escape mechanism is *not* a result of the BNF notation; it is the inherent desire that URIs contain only characters in ASCII. URIs that contain non-ASCII characters have to escape them "somehow", typically using the % notation.
- A URL/URI consists of a finite sequence of Unicode characters;
No. An URI contains of a finite sequence of characters. Whether they are Unicode or not is not specified. The assumption certainly is that if the characters are coded (i.e. assigned to numbers), those numbers don't have to match Unicode code points at all. An URI that consists of KOI-8R characters would very well be possible.
- urlopen(), and anything else that takes a URL/URI argument, must accept both str and unicode;
Certainly.
- If given unicode, each character in the string directly represents a character in the URL/URI and needs no interpretation;
No. Only ASCII characters in the string need no interpretation. For non-ASCII characters, urllib needs to assume some escaping mechanism.
- If given str, each byte in the string represents a character in the URL/URI according to US-ASCII interpretation;
Yes, if the bytes are meaningful in ASCII.
- Characters or bytes outside the ASCII range, and even certain characters in the ASCII range, are not permitted in a URL/URI, and thus the interpretation of a string containing them may result in an exception or other unpredictable results.
Yes.
- The urllib, urllib2, and urlparse modules currently do not claim to conform to any particular standards governing the interpretation of URLs; they merely acknowledge that some standards may be applicable. However, the intent is to provide standards-conformant behavior where possible, to the extent that the module APIs overlap with functionality mandated by current standards.
Yes. For input that is out of scope of existing standards, backwards compatibility is desirable, unless there is a strong indication that Python should have raised an exception for this input all along.
When the relevant standards become obsolete due to publication of updated standards (e.g. RFC 1630 -> 1738 -> 1808 -> 2396), the implementations *may* be updated accordingly, and users should expect behavior that conforms to either the current or obsoleted standards. Which standards are applicable to a particular implementation should be documented in the module and in its functions & classes where necessary.
Yes.
- urlopen() is documented as accepting a 'url' argument that is the URL of 'a network object' that can be read; a file-like object, based on either a local file or a socket, is normally returned. This 'network object' may be a local file if the 'file' scheme is used or if the URL's scheme component is omitted.
Yes.
If RFC 1808 applies (the current implementation is based largely on this spec, which did not clearly distinguish between a reference and a URI), it is what is defined in the grammar as a URL, and if it is relative (relativeURL in the grammar), it is considered to be relative to a default base URL.
This is troublesome. What is a meaningful base URL? This should be mentioned prominently.
- In urlopen() and the URLOpener classes it depends on, the default base URI is the result of resolving the result of os.getcwd(), converted to a URL by some undocumented means, against the base 'file:///'.
(I don't think this would require a change to the implementation, but it is a principle that should be agreed upon and documented, and perhaps the nuances of getcwd vs getcwdu should be addressed).
Sounds good.
- The resolution of URIs having the 'file' scheme is undertaken on the local filesystem according to conventions that should be, but presently aren't, documented. A preferred mapping of filesystem paths to URIs and back should be documented for each platform.
Ok.
- In urlopen(), the processing of a 'url' argument that is syntactically absolute may be nonconformant on platforms that use ":" in their filesystem paths. On such platforms, if the first ":" in what is syntactically an absolute URL/URI appears to be intended for use other than as a scheme component delimiter, the path will assumed to be relative. Furthermore, on Windows, '\', which is not allowed in a URL, or its equivalent percent- encoded sequence '%5C' (case-insensitive), will be interpreted as a '/' in the URL.
Ok.
(This mostly describes current behavior, assuming we can reach agreement that the "C:" in the example above should be treated no differently than "C|").
I have no problem with that. There are no one-letter URL schemata, are there?
I must attend to other things right now; will comment on the other issues later.
Take your time. This has been sitting around for many releases - one more or less doesn't matter much in the global flow of things :-) Regards, Martin

"Martin v. L> > If RFC 1808 applies (the current implementation is based largely
on this spec, which did not clearly distinguish between a reference and a URI), it is what is defined in the grammar as a URL, and if it is relative (relativeURL in the grammar), it is considered to be relative to a default base URL.
This is troublesome. What is a meaningful base URL? This should be mentioned prominently.
In effect, this is what happens in the current implementation, but I don't think it was ever anyone's intent to think of it in terms of standards-based resolution-to-absolute-form against a base URL, and in any event, it's not as well-documented as it should be. User expectation in most contexts, even when it doesn't apply (as in the most prominent use of relative references: HTML/XML document processing) is that relative references are relative to a base having something to do with the current working directory of the URL processor. Wrong as it often is to make such an assumption, in the case of urlopen() we have no context that would define a base URL. The documented precedent is that the 'file' scheme is assumed, and the implementation, IIRC, is such that the relative path is run through url2pathname which does very little to it, and it is then passed right to open(), so in effect the current working directory is assumed. For the sake of having a sane policy going forward, I would rather see the behavior expressed in terms that would be governed by standards, which is what I attempted to do. Luckily, the behavior is such that it is possible. There is an issue though: if disallowed/non-ASCII characters or bytes are in the urlopen() argument, and it's a relative URL, then right now the implementation is (I think) such that those characters or bytes pass through unchanged to the open() call. So if we do anything to these characters/bytes beforehand, such as %-encoding them as I think you were suggesting (see previous email), then for compatibility we'd have to specify that we're %-decoding them again in a way that results in the original characters/bytes being passed to open().
(This mostly describes current behavior, assuming we can reach agreement that the "C:" in the example above should be treated no differently than "C|").
I have no problem with that. There are no one-letter URL schemata, are there?
There aren't, although in principle I wish the API weren't lenient; people would quickly learn that C:\x\y\z is not a URL and C:/x/y/z is only allowed by the standards to be interpreted in one way: the one they probably don't want, and what they really need to do is learn to use file:///blahblahblah. In 4Suite's Ft.Lib.Uri we needed to conduct strictly conformant processing of URI references in our DOM, XPath, XSLT, and HTTP implementations. I found that we couldn't use urllib for hardly anything of this sort without a great deal of working around / closing up the holes opened by all these 'conveniences'. Tightening up the conformance issues meant that we needed to help users produce valid URIs from filesystem paths and vice-versa. Once again, the core Python libs were of little use -- pathname2url and url2pathname are platform-dependent, and are so full of bugs^H^H^H^Hfeatures that I had to start from scratch and roll my own functions. I think what I've got at this point would make great additions to urllib2, but I'll save them for another day... At least with all the "OKs" you've given so far, I can submit a patch or three to get some of the documentation updated.
I must attend to other things right now; will comment on the other issues later.
Take your time. This has been sitting around for many releases - one more or less doesn't matter much in the global flow of things :-)
Heh, agreed. I wish rfc2396bis and IRIs would hurry on through the IETF's machinery. I've only been actively paying attention to the former, but they both have a lot going for them.

On Fri, 17 Sep 2004, Mike Brown wrote: [...]
Tightening up the conformance issues meant that we needed to help users produce valid URIs from filesystem paths and vice-versa. Once again, the core Python libs were of little use -- pathname2url and url2pathname are platform-dependent, and are so full of bugs^H^H^H^Hfeatures that I had to start from scratch and roll my own functions. I think what I've got at this point would make great additions to urllib2, but I'll save them for another day...
You must be worn out after those posts :-), but: Would certainly be nice to have some more compliant, perhaps less forgiving functions for those tasks, so +1 for adding your OsPathToUri() and UriToOsPath() somewhere in the stdlib. Maybe urllib2 is as good a place as any. I suppose somebody knowledgeable about both Macs and URIs must volunteer to do the Mac work first, though. John
participants (3)
-
"Martin v. Löwis"
-
John J Lee
-
Mike Brown