[Python-Dev] urlparse brokenness

Mon Nov 28 03:24:12 CET 2005

On 11/22/05, Paul Jimenez <pj at place.org> wrote:
>
> It is my assertion that urlparse is currently broken.  Specifically, I
> think that urlparse breaks an abstraction boundary with ill effect.

IIRC I did it this way because the RFC about parsing urls specifically
prescribed it had to be done this way. Maybe there's a newer RFC with
different rules?

> In writing a mailclient, I wished to allow my users to specify their
> imap server as a url, such as 'imap://user:password@host:port/'. Which
> worked fine. I then thought that the natural extension to support
> configuration of imapssl would be 'imaps://user:password@host:port/'....
> which failed - user:passwrod at host:port got parsed as the *path* of
> the URL instead of the network location. It turns out that urlparse
> keeps a table of url schemes that 'use netloc'... that is to say,
> that have a 'user:password at host:port' part to their URL. I think this
> 'special knowledge' about particular schemes 1) breaks an abstraction
> boundary by having a function whose charter is to pull apart a
> particularly-formatted string behave differently based on the meaning of
> the string instead of the structure of it

I disagree. You have to know what the scheme means before you can
parse the rest -- there is (by design!) no standard parsing for
anything that follows the scheme and the colon. I don't even think
that you can trust that if the colon is followed by two slashes that
what follows is a netloc for all schemes.

But if there's an RFC that says otherwise I'll gladly concede;
urlparse's main goal in life is to b RFC compliant. Is your opinion
based on an RFC?

> and 2) fails to be extensible
> or forward compatible due to hardcoded 'magic' strings - if schemes were
> somehow 'registerable' as 'netloc using' or not, then this objection
> might be nullified, but the previous objection would still stand.

I think it is reasonable to propose an extension whereby one can
register a parser (or parsing flags like uses_netloc) for a specific
scheme, presuming there won't be conflicting registrations (which
should only happen if two independently developed libraries have a
different use for the same scheme -- a failure of standardization).

> So I propose that urlsplit, the main offender, be replaced with something
> that looks like:
>
> def urlsplit(url, scheme='', allow_fragments=1, default=('','','','','')):

Since you don't present your new code in diff format, could you
explain in English how what it does differs from the original? Or
perhaps you could present some unit tests (doctest would be ideal)
showing the desired behavior of the proposed code (I understand from
later posts that it may have some bugs). (For example, why add the
default parameter?)

> Note that I'm not sold on the _parse_cache, but I'm assuming it was there
> for a reason so I'm leaving that functionality as-is.

There's also a special case for http; given that the code is rather
general and hence slow, it makes sense that it attempts some
optimizations, and removing these might cause a nasty surprise for
some users.

> If this isn't the right forum for this discussion, or the right place to
> submit code, please let me know.

Please do submit patches to SF if you want then to be discussed.

> Also, please cc: me directly on responses
> as I'm not subscribed to the firehose that is python-dev.

ACK.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)