[Python-Dev] urlparse brokenness

Guido van Rossum guido at python.org
Mon Nov 28 15:53:35 CET 2005


OK, you've convinced me. But for backwards compatibility (until Python
3000), a new API should be designed. We can't change the old API in an
incompatible way. Please submit complete code + docs to SF. (If you
think this requires much design work, a PEP may be in order but I
think that given the new RFCs it's probably straightforward enough to
not require that.

--Guido

On 11/27/05, Mike Brown <mike at skew.org> wrote:
> Guido van Rossum wrote:
> > IIRC I did it this way because the RFC about parsing urls specifically
> > prescribed it had to be done this way.
>
> That was true as of RFC 1808 (1995-1998), although the grammar actually
> allowed for a more generic interpretation.
>
> Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular
> expression for parsing URI 'references' (a formal abstraction introduced in
> 2396) into 5 components (not six, since 'params' were moved into 'path'
> and eventually became an option on every path segment, not just the end
> of the path). The 5 components are:
>
>   scheme, authority (formerly netloc), path, query, fragment.
>
> Parsing could result in some components being undefined, which is distinct
> from being empty (e.g., 'mailto:foo at bar?' would have an undefined authority
> and fragment, and a defined, but empty, query).
>
> RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes
> several references to these '5 major components' of a URI, and says that these
> components are scheme-independent; parsers that operate at the generic syntax
> level "can parse any URI reference into its major components. Once the scheme
> is determined, further scheme-specific parsing can be performed on the
> components."
>
> > You have to know what the scheme means before you can
> > parse the rest -- there is (by design!) no standard parsing for
> > anything that follows the scheme and the colon.
>
> Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI
> references can be interpreted as having the 5 components, it was made explicit
> in RFC 3986 / STD 66.
>
> > I don't even think
> > that you can trust that if the colon is followed by two slashes that
> > what follows is a netloc for all schemes.
>
> You can.
>
> > But if there's an RFC that says otherwise I'll gladly concede;
> > urlparse's main goal in life is to b RFC compliant.
>
> Its intent seems to be to split a URI into its major components, which are now
> by definition scheme-independent (and have been, implicitly, for a long time),
> so the function shouldn't distinguish between schemes.
>
> Do you want to keep returning that 6-tuple, or can we make it return a
> 5-tuple? If we keep returning 'params' for backward compatibility, then that
> means the 'path' we are returning is not the 'path' that people would expect
> (they'll have to concatenate path+params to get what the generic syntax calls
> a 'path' nowadays). It's also deceptive because params are now allowed on all
> path segments, and the current function only takes them from the last segment.
>
> Also for backward compatibility, should an absent component continue to
> manifest in the result as an empty string? I think a compliant parser should
> make a distinction between absent and empty (it could make a difference, in
> theory).
>
> If a regular expression were used for parsing, it would produce None for
> absent components and empty-string for empty ones. I implemented it this
> way in 4Suite's Ft.Lib.Uri and it works nicely.
>
> Mike
>


--
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-Dev mailing list