[Python-Dev] Finally switch urllib.parse to RFC3986 semantics?

Senthil Kumaran orsenthil at gmail.com
Wed Mar 16 05:03:00 CET 2011


Nick Coghlan wrote:
> 
> Backwards compatible with *what* though?

I meant the parsing 'behavior'.

> For the decimal module, we treat deviations from spec as bug fixes and
> update accordingly, even if this changes behaviour.
> 
> For URL parsing, the spec has changed (6 years ago!), but we still
> don't provide a spec-conformant implementation, even via a flag or new
> function.

If I understand correctly, by spec-comformant implementation, you mean
having the parsed components denoted by the same terminology (as well
as behavior) as written in the RFC3986. 

Like the example in the url denote:


         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment
          |   _____________________|__
         / \ /                        \
         urn:example:animal:ferret:nose

If I send the same url's via urlparse at the moment, I would get:

>>> urlparse('foo://example.com:8042/over/there?name=ferret#nose')
ParseResult(scheme='foo', netloc='example.com:8042', path='/over/there?name=ferret#nose', params='', query='', fragment='')
>>> urlparse('urn:example:animal:ferret:nose')
ParseResult(scheme='urn', netloc='', path='example:animal:ferret:nose', params='', query='', fragment='')

The first one is because, we still have "old" scheme specific parsing behavior.
Where foo is an unrecognized scheme so everything was classified under path. If
we have valid scheme name, then the parsing behaviour would match the
expectation.

- A change to this would break the compatibility with older parsing behavior.

Another point to note is naming - We use 'netloc' as part name loosely, where
as 'authority' is correct term to use and then authority component has
sub-parts.  

- I think, it is good to change this and adopt the RFC terminology more rigorously.

I am +1 to any helpful improvement we can do in this module. But often it
noticed that any slightest changes in parsing behavior has caused harm and
brought us more bug-reports.

A new function, which can given this behavior is also a good idea.

-- 
Senthil


More information about the Python-Dev mailing list