It is my assertion that urlparse is currently broken. Specifically, I think that urlparse breaks an abstraction boundary with ill effect. In writing a mailclient, I wished to allow my users to specify their imap server as a url, such as 'imap://user:password@host:port/'. Which worked fine. I then thought that the natural extension to support configuration of imapssl would be 'imaps://user:password@host:port/'.... which failed - user:passwrod@host:port got parsed as the *path* of the URL instead of the network location. It turns out that urlparse keeps a table of url schemes that 'use netloc'... that is to say, that have a 'user:password@host:port' part to their URL. I think this 'special knowledge' about particular schemes 1) breaks an abstraction boundary by having a function whose charter is to pull apart a particularly-formatted string behave differently based on the meaning of the string instead of the structure of it and 2) fails to be extensible or forward compatible due to hardcoded 'magic' strings - if schemes were somehow 'registerable' as 'netloc using' or not, then this objection might be nullified, but the previous objection would still stand. So I propose that urlsplit, the main offender, be replaced with something that looks like: def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')): """Parse a URL into 5 components: <scheme>://<netloc>/<path>?<query>#<fragment> Return a 5-tuple: (scheme, netloc, path, query, fragment). Note that we don't break the components up in smaller bits (e.g. netloc is a single string) and we don't expand % escapes.""" key = url, scheme, allow_fragments, default cached = _parse_cache.get(key, None) if cached: return cached if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth clear_cache() if "://" in url: uscheme, npqf = url.split("://", 1) else: uscheme = scheme if not uscheme: uscheme = default[0] npqf = url pathidx = npqf.find('/') if pathidx == -1: # not found netloc = npqf path, query, fragment = default[1:4] else: netloc = npqf[:pathidx] pqf = npqf[pathidx:] if '?' in pqf: path, qf = pqf.split('?',1) else: path, qf = pqf, ''.join(default[3:5]) if ('#' in qf) and allow_fragments: query, fragment = qf.split('#',1) else: query, fragment = default[3:5] tuple = (uscheme, netloc, path, query, fragment) _parse_cache[key] = tuple return tuple Note that I'm not sold on the _parse_cache, but I'm assuming it was there for a reason so I'm leaving that functionality as-is. If this isn't the right forum for this discussion, or the right place to submit code, please let me know. Also, please cc: me directly on responses as I'm not subscribed to the firehose that is python-dev. --pj
On Tue, Nov 22, 2005, Paul Jimenez wrote:
If this isn't the right forum for this discussion, or the right place to submit code, please let me know. Also, please cc: me directly on responses as I'm not subscribed to the firehose that is python-dev.
This is the right forum for discussion. You should post your patch to SourceForge *before* starting a discussion on python-dev, including a link to the patch in your post. It is not essential, but it is certainly a courtesy to subscribe to python-dev for the duration of the discussion; you can feel feel to filter threads you're not interested in. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair
Paul Jimenez wrote:
So I propose that urlsplit, the main offender, be replaced with something that looks like:
def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):
+1 in principle. You should probably do a global _parse_cache and add 'is not None' after 'if cached'.
On Tue, 2005-11-22 at 23:04 -0600, Paul Jimenez wrote:
It is my assertion that urlparse is currently broken. Specifically, I think that urlparse breaks an abstraction boundary with ill effect.
In writing a mailclient, I wished to allow my users to specify their imap server as a url, such as 'imap://user:password@host:port/'. Which worked fine. I then thought that the natural extension to support
FWIW, I have a small addition related to this that I think would be
handy to add to the urlparse module. It is a pair of functions
"netlocparse()" and "netlocunparse()" that is for parsing and unparsing
"user:password@host:port" netloc's.
Feel free to use/add/ignore it...
http://minkirri.apana.org.au/~abo/projects/osVFS/netlocparse.py
--
Donovan Baarda
On 11/22/05, Paul Jimenez
It is my assertion that urlparse is currently broken. Specifically, I think that urlparse breaks an abstraction boundary with ill effect.
IIRC I did it this way because the RFC about parsing urls specifically prescribed it had to be done this way. Maybe there's a newer RFC with different rules?
In writing a mailclient, I wished to allow my users to specify their imap server as a url, such as 'imap://user:password@host:port/'. Which worked fine. I then thought that the natural extension to support configuration of imapssl would be 'imaps://user:password@host:port/'.... which failed - user:passwrod@host:port got parsed as the *path* of the URL instead of the network location. It turns out that urlparse keeps a table of url schemes that 'use netloc'... that is to say, that have a 'user:password@host:port' part to their URL. I think this 'special knowledge' about particular schemes 1) breaks an abstraction boundary by having a function whose charter is to pull apart a particularly-formatted string behave differently based on the meaning of the string instead of the structure of it
I disagree. You have to know what the scheme means before you can parse the rest -- there is (by design!) no standard parsing for anything that follows the scheme and the colon. I don't even think that you can trust that if the colon is followed by two slashes that what follows is a netloc for all schemes. But if there's an RFC that says otherwise I'll gladly concede; urlparse's main goal in life is to b RFC compliant. Is your opinion based on an RFC?
and 2) fails to be extensible or forward compatible due to hardcoded 'magic' strings - if schemes were somehow 'registerable' as 'netloc using' or not, then this objection might be nullified, but the previous objection would still stand.
I think it is reasonable to propose an extension whereby one can register a parser (or parsing flags like uses_netloc) for a specific scheme, presuming there won't be conflicting registrations (which should only happen if two independently developed libraries have a different use for the same scheme -- a failure of standardization).
So I propose that urlsplit, the main offender, be replaced with something that looks like:
def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):
Since you don't present your new code in diff format, could you explain in English how what it does differs from the original? Or perhaps you could present some unit tests (doctest would be ideal) showing the desired behavior of the proposed code (I understand from later posts that it may have some bugs). (For example, why add the default parameter?)
Note that I'm not sold on the _parse_cache, but I'm assuming it was there for a reason so I'm leaving that functionality as-is.
There's also a special case for http; given that the code is rather general and hence slow, it makes sense that it attempts some optimizations, and removing these might cause a nasty surprise for some users.
If this isn't the right forum for this discussion, or the right place to submit code, please let me know.
Please do submit patches to SF if you want then to be discussed.
Also, please cc: me directly on responses as I'm not subscribed to the firehose that is python-dev.
ACK. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
IIRC I did it this way because the RFC about parsing urls specifically prescribed it had to be done this way.
That was true as of RFC 1808 (1995-1998), although the grammar actually allowed for a more generic interpretation. Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular expression for parsing URI 'references' (a formal abstraction introduced in 2396) into 5 components (not six, since 'params' were moved into 'path' and eventually became an option on every path segment, not just the end of the path). The 5 components are: scheme, authority (formerly netloc), path, query, fragment. Parsing could result in some components being undefined, which is distinct from being empty (e.g., 'mailto:foo@bar?' would have an undefined authority and fragment, and a defined, but empty, query). RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes several references to these '5 major components' of a URI, and says that these components are scheme-independent; parsers that operate at the generic syntax level "can parse any URI reference into its major components. Once the scheme is determined, further scheme-specific parsing can be performed on the components."
You have to know what the scheme means before you can parse the rest -- there is (by design!) no standard parsing for anything that follows the scheme and the colon.
Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI references can be interpreted as having the 5 components, it was made explicit in RFC 3986 / STD 66.
I don't even think that you can trust that if the colon is followed by two slashes that what follows is a netloc for all schemes.
You can.
But if there's an RFC that says otherwise I'll gladly concede; urlparse's main goal in life is to b RFC compliant.
Its intent seems to be to split a URI into its major components, which are now by definition scheme-independent (and have been, implicitly, for a long time), so the function shouldn't distinguish between schemes. Do you want to keep returning that 6-tuple, or can we make it return a 5-tuple? If we keep returning 'params' for backward compatibility, then that means the 'path' we are returning is not the 'path' that people would expect (they'll have to concatenate path+params to get what the generic syntax calls a 'path' nowadays). It's also deceptive because params are now allowed on all path segments, and the current function only takes them from the last segment. Also for backward compatibility, should an absent component continue to manifest in the result as an empty string? I think a compliant parser should make a distinction between absent and empty (it could make a difference, in theory). If a regular expression were used for parsing, it would produce None for absent components and empty-string for empty ones. I implemented it this way in 4Suite's Ft.Lib.Uri and it works nicely. Mike
OK, you've convinced me. But for backwards compatibility (until Python
3000), a new API should be designed. We can't change the old API in an
incompatible way. Please submit complete code + docs to SF. (If you
think this requires much design work, a PEP may be in order but I
think that given the new RFCs it's probably straightforward enough to
not require that.
--Guido
On 11/27/05, Mike Brown
Guido van Rossum wrote:
IIRC I did it this way because the RFC about parsing urls specifically prescribed it had to be done this way.
That was true as of RFC 1808 (1995-1998), although the grammar actually allowed for a more generic interpretation.
Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular expression for parsing URI 'references' (a formal abstraction introduced in 2396) into 5 components (not six, since 'params' were moved into 'path' and eventually became an option on every path segment, not just the end of the path). The 5 components are:
scheme, authority (formerly netloc), path, query, fragment.
Parsing could result in some components being undefined, which is distinct from being empty (e.g., 'mailto:foo@bar?' would have an undefined authority and fragment, and a defined, but empty, query).
RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes several references to these '5 major components' of a URI, and says that these components are scheme-independent; parsers that operate at the generic syntax level "can parse any URI reference into its major components. Once the scheme is determined, further scheme-specific parsing can be performed on the components."
You have to know what the scheme means before you can parse the rest -- there is (by design!) no standard parsing for anything that follows the scheme and the colon.
Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI references can be interpreted as having the 5 components, it was made explicit in RFC 3986 / STD 66.
I don't even think that you can trust that if the colon is followed by two slashes that what follows is a netloc for all schemes.
You can.
But if there's an RFC that says otherwise I'll gladly concede; urlparse's main goal in life is to b RFC compliant.
Its intent seems to be to split a URI into its major components, which are now by definition scheme-independent (and have been, implicitly, for a long time), so the function shouldn't distinguish between schemes.
Do you want to keep returning that 6-tuple, or can we make it return a 5-tuple? If we keep returning 'params' for backward compatibility, then that means the 'path' we are returning is not the 'path' that people would expect (they'll have to concatenate path+params to get what the generic syntax calls a 'path' nowadays). It's also deceptive because params are now allowed on all path segments, and the current function only takes them from the last segment.
Also for backward compatibility, should an absent component continue to manifest in the result as an empty string? I think a compliant parser should make a distinction between absent and empty (it could make a difference, in theory).
If a regular expression were used for parsing, it would produce None for absent components and empty-string for empty ones. I implemented it this way in 4Suite's Ft.Lib.Uri and it works nicely.
Mike
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (6)
-
Aahz
-
Donovan Baarda
-
Guido van Rossum
-
John J Lee
-
Mike Brown
-
Paul Jimenez