[Python-Dev] urlparse brokenness
Paul Jimenez
pj at place.org
Wed Nov 23 06:04:55 CET 2005
It is my assertion that urlparse is currently broken. Specifically, I
think that urlparse breaks an abstraction boundary with ill effect.
In writing a mailclient, I wished to allow my users to specify their
imap server as a url, such as 'imap://user:password@host:port/'. Which
worked fine. I then thought that the natural extension to support
configuration of imapssl would be 'imaps://user:password@host:port/'....
which failed - user:passwrod at host:port got parsed as the *path* of
the URL instead of the network location. It turns out that urlparse
keeps a table of url schemes that 'use netloc'... that is to say,
that have a 'user:password at host:port' part to their URL. I think this
'special knowledge' about particular schemes 1) breaks an abstraction
boundary by having a function whose charter is to pull apart a
particularly-formatted string behave differently based on the meaning of
the string instead of the structure of it and 2) fails to be extensible
or forward compatible due to hardcoded 'magic' strings - if schemes were
somehow 'registerable' as 'netloc using' or not, then this objection
might be nullified, but the previous objection would still stand.
So I propose that urlsplit, the main offender, be replaced with something
that looks like:
def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):
"""Parse a URL into 5 components:
<scheme>://<netloc>/<path>?<query>#<fragment>
Return a 5-tuple: (scheme, netloc, path, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes."""
key = url, scheme, allow_fragments, default
cached = _parse_cache.get(key, None)
if cached:
return cached
if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
clear_cache()
if "://" in url:
uscheme, npqf = url.split("://", 1)
else:
uscheme = scheme
if not uscheme:
uscheme = default[0]
npqf = url
pathidx = npqf.find('/')
if pathidx == -1: # not found
netloc = npqf
path, query, fragment = default[1:4]
else:
netloc = npqf[:pathidx]
pqf = npqf[pathidx:]
if '?' in pqf:
path, qf = pqf.split('?',1)
else:
path, qf = pqf, ''.join(default[3:5])
if ('#' in qf) and allow_fragments:
query, fragment = qf.split('#',1)
else:
query, fragment = default[3:5]
tuple = (uscheme, netloc, path, query, fragment)
_parse_cache[key] = tuple
return tuple
Note that I'm not sold on the _parse_cache, but I'm assuming it was there
for a reason so I'm leaving that functionality as-is.
If this isn't the right forum for this discussion, or the right place to
submit code, please let me know. Also, please cc: me directly on responses
as I'm not subscribed to the firehose that is python-dev.
--pj
More information about the Python-Dev
mailing list