New subject: URL processing conformance and principles (was Re: urllib.urlopen...)

Sept. 16, 2004


      {I hit sent too early, here is the rest }

Mike Brown wrote:
...
Right. This part of the thread was just about how the argument to 
urllib.urlopen() should be handled when given as unicode vs str. You seemed to 
be saying it should be str because a URI is fundamentally bytes and should be 
analyzed as such, whereas I'm saying no, a URI is fundamentally characters and 
should be analyzed as such. I mentioned %-encoding and the quirk of the BNF 
just because those are aspects of the syntax that are byte-oriented and are the 
source of much confusion, and because they may have influenced your assertion.
Are we in agreement on these points?
I think I have to answer "no". The % notation is not a quirk of the BNF.
I.e. when the BNF states that an URI contains %AC (say), this does *not*
mean that the actual URI in-memory-or-on-the-wire contains the byte
\xAC. The spec actually says that the URI, in memory, on the wire, or
on paper, actually contains the three character '%', 'A', and 'C'. So
usage of that escape mechanism is *not* a result of the BNF notation;
it is the inherent desire that URIs contain only characters in ASCII.
URIs that contain non-ASCII characters have to escape them "somehow",
typically using the % notation.
...
-  A URL/URI consists of a finite sequence of Unicode characters;
No. An URI contains of a finite sequence of characters. Whether they
are Unicode or not is not specified. The assumption certainly is that
if the characters are coded (i.e. assigned to numbers), those numbers
don't have to match Unicode code points at all. An URI that consists
of KOI-8R characters would very well be possible.
...
-  urlopen(), and anything else that takes a URL/URI argument,
    must accept both str and unicode;
Certainly.
...
-  If given unicode, each character in the string directly represents
    a character in the URL/URI and needs no interpretation;
No. Only ASCII characters in the string need no interpretation. For
non-ASCII characters, urllib needs to assume some escaping mechanism.
...
-  If given str, each byte in the string represents a character in
    the URL/URI according to US-ASCII interpretation;
Yes, if the bytes are meaningful in ASCII.
...
-  Characters or bytes outside the ASCII range, and even certain
    characters in the ASCII range, are not permitted in a URL/URI,
    and thus the interpretation of a string containing them may
    result in an exception or other unpredictable results.
Yes.
...
-  The urllib, urllib2, and urlparse modules currently do not
    claim to conform to any particular standards governing the
    interpretation of URLs; they merely acknowledge that some
    standards may be applicable. However, the intent is to provide
    standards-conformant behavior where possible, to the extent 
    that the module APIs overlap with functionality mandated by
    current standards.
Yes. For input that is out of scope of existing standards, backwards
compatibility is desirable, unless there is a strong indication that
Python should have raised an exception for this input all along.
...
When the relevant standards become obsolete due to publication
    of updated standards (e.g. RFC 1630 -> 1738 -> 1808 -> 2396),
    the implementations *may* be updated accordingly, and users
    should expect behavior that conforms to either the current or
    obsoleted standards. Which standards are applicable to a
    particular implementation should be documented in the module
    and in its functions & classes where necessary.
Yes.
...
-  urlopen() is documented as accepting a 'url' argument that is
    the URL of 'a network object' that can be read; a file-like
    object, based on either a local file or a socket, is normally
    returned. This 'network object' may be a local file if the
    'file' scheme is used or if the URL's scheme component is omitted.
Yes.
...
If RFC 1808 applies (the current implementation is based largely
    on this spec, which did not clearly distinguish between a reference
    and a URI), it is what is defined in the grammar as a URL, and
    if it is relative (relativeURL in the grammar), it is considered
    to be relative to a default base URL.
This is troublesome. What is a meaningful base URL? This should be 
mentioned prominently.
...
-  In urlopen() and the URLOpener classes it depends on, the default
    base URI is the result of resolving the result of os.getcwd(),
    converted to a URL by some undocumented means, against the base
    'file:///'.
(I don't think this would require a change to the implementation,
    but it is a principle that should be agreed upon and documented,
    and perhaps the nuances of getcwd vs getcwdu should be addressed).
Sounds good.
...
-  The resolution of URIs having the 'file' scheme is undertaken on
    the local filesystem according to conventions that should be, but
    presently aren't, documented. A preferred mapping of filesystem
    paths to URIs and back should be documented for each platform.
Ok.
...
-  In urlopen(), the processing of a 'url' argument that is
    syntactically absolute may be nonconformant on platforms
    that use ":" in their filesystem paths. On such platforms, if the
    first ":" in what is syntactically an absolute URL/URI appears to
    be intended for use other than as a scheme component delimiter,
    the path will assumed to be relative. Furthermore, on Windows,
    '\', which is not allowed in a URL, or its equivalent percent-
    encoded sequence '%5C' (case-insensitive), will be interpreted as
    a '/' in the URL.
Ok.
...
(This mostly describes current behavior, assuming we can reach
    agreement that the "C:" in the example above should be treated
    no differently than "C|").
I have no problem with that. There are no one-letter URL schemata,
are there?
...
I must attend to other things right now; will comment on the other issues 
later.
Take your time. This has been sitting around for many releases - one
more or less doesn't matter much in the global flow of things :-)

Regards,
Martin

Re: URL processing conformance and principles (was Re: urllib.urlopen...)

"Martin v. Löwis"

Mike Brown

John J Lee

tags

participants (3)