[Python-Dev] urlparse brokenness

Wed Nov 23 06:04:55 CET 2005

It is my assertion that urlparse is currently broken.  Specifically, I 
think that urlparse breaks an abstraction boundary with ill effect.

In writing a mailclient, I wished to allow my users to specify their
imap server as a url, such as 'imap://user:password@host:port/'. Which
worked fine. I then thought that the natural extension to support
configuration of imapssl would be 'imaps://user:password@host:port/'....
which failed - user:passwrod at host:port got parsed as the *path* of
the URL instead of the network location. It turns out that urlparse
keeps a table of url schemes that 'use netloc'... that is to say,
that have a 'user:password at host:port' part to their URL. I think this
'special knowledge' about particular schemes 1) breaks an abstraction
boundary by having a function whose charter is to pull apart a
particularly-formatted string behave differently based on the meaning of
the string instead of the structure of it and 2) fails to be extensible
or forward compatible due to hardcoded 'magic' strings - if schemes were
somehow 'registerable' as 'netloc using' or not, then this objection
might be nullified, but the previous objection would still stand.

So I propose that urlsplit, the main offender, be replaced with something
that looks like:

def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):
    """Parse a URL into 5 components:
    <scheme>://<netloc>/<path>?<query>#<fragment>
    Return a 5-tuple: (scheme, netloc, path, query, fragment).
    Note that we don't break the components up in smaller bits
    (e.g. netloc is a single string) and we don't expand % escapes."""
    key = url, scheme, allow_fragments, default
    cached = _parse_cache.get(key, None)
    if cached:
        return cached
    if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
        clear_cache()

    if "://" in url:
        uscheme, npqf = url.split("://", 1)
    else:
        uscheme = scheme
        if not uscheme:
            uscheme = default[0]
        npqf = url
    pathidx = npqf.find('/')
    if pathidx == -1:  # not found
        netloc = npqf
        path, query, fragment = default[1:4]
    else:
        netloc = npqf[:pathidx]
        pqf = npqf[pathidx:]
        if '?' in pqf:
            path, qf = pqf.split('?',1)
        else:
            path, qf = pqf, ''.join(default[3:5])
        if ('#' in qf) and allow_fragments:
            query, fragment = qf.split('#',1)
        else:
            query, fragment = default[3:5]
    tuple = (uscheme, netloc, path, query, fragment)
    _parse_cache[key] = tuple
    return tuple

Note that I'm not sold on the _parse_cache, but I'm assuming it was there
for a reason so I'm leaving that functionality as-is.

If this isn't the right forum for this discussion, or the right place to 
submit code, please let me know.  Also, please cc: me directly on responses
as I'm not subscribed to the firehose that is python-dev.

  --pj