[Python-Dev] urlparse brokenness
mike at skew.org
Mon Nov 28 06:07:08 CET 2005
Guido van Rossum wrote:
> IIRC I did it this way because the RFC about parsing urls specifically
> prescribed it had to be done this way.
That was true as of RFC 1808 (1995-1998), although the grammar actually
allowed for a more generic interpretation.
Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular
expression for parsing URI 'references' (a formal abstraction introduced in
2396) into 5 components (not six, since 'params' were moved into 'path'
and eventually became an option on every path segment, not just the end
of the path). The 5 components are:
scheme, authority (formerly netloc), path, query, fragment.
Parsing could result in some components being undefined, which is distinct
from being empty (e.g., 'mailto:foo at bar?' would have an undefined authority
and fragment, and a defined, but empty, query).
RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes
several references to these '5 major components' of a URI, and says that these
components are scheme-independent; parsers that operate at the generic syntax
level "can parse any URI reference into its major components. Once the scheme
is determined, further scheme-specific parsing can be performed on the
> You have to know what the scheme means before you can
> parse the rest -- there is (by design!) no standard parsing for
> anything that follows the scheme and the colon.
Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI
references can be interpreted as having the 5 components, it was made explicit
in RFC 3986 / STD 66.
> I don't even think
> that you can trust that if the colon is followed by two slashes that
> what follows is a netloc for all schemes.
> But if there's an RFC that says otherwise I'll gladly concede;
> urlparse's main goal in life is to b RFC compliant.
Its intent seems to be to split a URI into its major components, which are now
by definition scheme-independent (and have been, implicitly, for a long time),
so the function shouldn't distinguish between schemes.
Do you want to keep returning that 6-tuple, or can we make it return a
5-tuple? If we keep returning 'params' for backward compatibility, then that
means the 'path' we are returning is not the 'path' that people would expect
(they'll have to concatenate path+params to get what the generic syntax calls
a 'path' nowadays). It's also deceptive because params are now allowed on all
path segments, and the current function only takes them from the last segment.
Also for backward compatibility, should an absent component continue to
manifest in the result as an empty string? I think a compliant parser should
make a distinction between absent and empty (it could make a difference, in
If a regular expression were used for parsing, it would produce None for
absent components and empty-string for empty ones. I implemented it this
way in 4Suite's Ft.Lib.Uri and it works nicely.
More information about the Python-Dev