[Python-Dev] urlparse.urlunsplit should be smarter about +

Tue May 11 06:55:30 CEST 2010

On Mon, May 10, 2010 at 05:56:29PM +0900, Stephen J. Turnbull wrote:
> Senthil Kumaran writes:
> 
>  > I should have said, 'treatment of urls with authority' and 'treatment
>  > of urls without authority' in terms of parsing and joining is as per
>  > RFC.  How it is doing practically is by maintaining a list of urls
>  > with known scheme names which use_netloc.
> 
> Why do that if you can get better behavior based purely on syntactic
> analysis?

For the cases for just parsing and splitting, the syntactic behaviours
are fine enough. I agree with your comments and reinstatement of RFC
rules in the previous emails. 

The problem as we know off, comes while unparsing and joining, ( also
I have not yet looked at the relative url joining behaviour where
redundant /'s can be ignored).

As you may already know, when the data is

ParseResult(scheme='file', netloc='', path='/tmp/junk.txt', params='',
query='', fragment='')

You might expect the output to be file:///tmp/junk.txt
Original might be same too.

But for:
ParseResult(scheme='x', netloc='', path='/y', params='', query='',
fragment='')

One can expect a valid output to be: x:/y

Your suggestion of netloc/authority being differentiate by '' and None
seems a good one to analyze.

Also, by keeping a registry of valid schemes, are you not proposing
something very similar to uses_netloc? But with a different API to
handle parsing based on registry values. Is my understanding of your
proposal correct?

FWIW, I looked at the history of uses_netloc list and it seems that it
been there from the first version when urlparse module followed
different rfc specs for different protocols (telnet, sip etc), so any
changes should be carefully incorporated as not to break the existing
solutions.

-- 
Senthil