[Python-Dev] Path object design

Mon Nov 6 12:04:28 CET 2006

Martin:
> It still should be possible to come up with examples for these as
> well, no? For example, if you pass a relative URI as the base
> URI, what would you like to see happen?

Until two days ago I didn't even realize that was an incorrect
use of urljoin.  I can't be the only one.  Hence, raise an
exception - just like 4Suite's Uri.py does.

> That's true. Actually, it's probably not true; it will only get fixed
> if some volunteer contributes a fix.

And it's not I.  A true fix is a lot of work.  I would rather use Uri.py,
now that I see it handles everything I care about, and then some.
Eg, file name <-> URI conversion.

> So do you think this patch meets your requirements?

# new
>>> uriparse.urljoin("http://spam/", "foo/bar")
'http://spam//foo/bar'
>>>

# existing
>>> urlparse.urljoin("http://spam/", "foo/bar")
'http://spam/foo/bar'
>>>

No.  That was the first thing I tried.  Also found

>>> urlparse.urljoin("http://blah", "/spam/")
'http://blah/spam/'
>>> uriparse.urljoin("http://blah", "/spam/")
'http://blah/spam'
>>>

I reported these on the  patch page.  Nothing else strange
came up, but I did only try http urls and not the others.

My "requirements", meaning my vague, spur-of-the-moment thoughts
without any research or experimentation to determing their validity,
are different than those for Python.

My real requirements are met by the existing code.

My imagined ones include support for edge cases, the idna
codec, unicode, and real-world use on a variety of OSes.

4Suite's Uri.py seems to have this.  Eg, lots of edge-case
code like

    # On Windows, ensure that '|', not ':', is used in a drivespec.
    if os.name == 'nt' and scheme == 'file':
        path = path.replace(':','|',1)

Hence the uriparse.py patch does not meet my hypothetical
requirements .

Python's requirements are probably to get closer to the spec.
In which case yes, it's at least as good as and likely generally
better than the existing module, modulo a few API naming debates
and perhaps some rough edges which will be found when put into use.

And perhaps various arguments about how bug compatible it should be
and if the old code should be available as well as the new one,
for those who depend on the existing 1808-allowed implementation
dependent behavior.

For those I have not the experience to guide me and no care to push
the debate.  I've decided I'm going to experiment using 4Suite's Uri.py
for my code because it handles things I want which are outside of
the scope of uriparse.py

> This topic (URL parsing) is not only inherently difficult to
> implement, it is just as tedious to review. Without anybody
> reviewing the contributed code, it's certain that it will never
> be incorporated.

I have a different opinion.

Python's url manipulation code is a mess.  urlparse, urllib, urllib2.
Why is "urlencode" part of urllib and not urllib2?  For that matter,
urllib is labeled 'Open an arbitrary URL' and not 'and also do
manipulations on parts of URLs."

I don't want to start fixing code because doing it the way I want to
requires a new API and a much better understanding of the RFCs
than I care about, especially since 4Suite and others have already
done this.

Hence I would say to just grab their library.  And perhaps update the
naming scheme.

Also, urlgrabber and pycURL are better for downloading arbitrary
URIs.  For some definitions of "better".

                Andrew
                dalke at dalkescientific.com