[Python-bugs-list] [ python-Bugs-450225 ] urljoin fails RFC tests

SourceForge.net noreply@sourceforge.net
Thu, 12 Jun 2003 00:24:59 -0700


Bugs item #450225, was opened at 2001-08-11 22:10
Message generated for change (Comment added) made by bcannon
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=450225&group_id=5470

Category: Python Library
>Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Aaron Swartz (aaronsw)
Assigned to: Brett Cannon (bcannon)
Summary: urljoin fails RFC tests

Initial Comment:
I've put together a test suite for Python's URLparse 
module, based on the tests in Appendix C of 
RFC2396 (the URI RFC). They're available at:

http://lists.w3.org/Archives/Public/uri/2001Aug/
0013.html

The major problem seems to be that it treats 
queries and parameters as special components 
(not just normal parts of the path), making this 
related to:

http://sourceforge.net/tracker/?group_id=5470&
atid=105470&func=detail&aid=210834

----------------------------------------------------------------------

>Comment By: Brett Cannon (bcannon)
Date: 2003-06-12 00:24

Message:
Logged In: YES 
user_id=357491

Since there is the random possibility that this might break code 
that depends on this to act like RFC 1808 instead of 2396 and 
2.3 has hit beta I am going to wait for 2.4 before I deal with this.

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2003-05-11 17:35

Message:
Logged In: YES 
user_id=357491

mbrierst is right.  From C.1 of RFC 2396 (with http://a/b/c/d;p?q as the 
base):

    ?y            =  http://a/b/c/?y
    ;x            =  http://a/b/c/;x

And notice how this contradicts RFC 1808 ( with <URL:http://a/b/c/
d;p?q#f> as the base):

    ?y         = <URL:http://a/b/c/d;p?y>
    ;x         = <URL:http://a/b/c/d;x>

So obviously there is a conflict here.  And since RFC 2396 says "it revises and 
replaces the generic definitions in RFC 1738 and RFC 1808" (of which 
"generic" just means the actual syntax) this means that RFC 2396's solution 
should override.

Now the issue is whether the patch for this is the right thing to do (I am 
ignoring if the patch is correct; have not tested it yet).  This shouldn't break 
anything since the whole point of urlparse.urljoin is to have an abstracted 
way to create URIs without the user having to worry about all of these rules.  
So I say that it should be changed.

Fred, do you mind if I reassign this patch to myself and deal with it?

----------------------------------------------------------------------

Comment By: Michael Stone (mbrierst)
Date: 2003-02-03 13:02

Message:
Logged In: YES 
user_id=670441

The two failing tests could not pass because RFC 1808 and RFC 2396 seem to conflict when a relative URI is given as just ;y or just ?y.

RFC 2396 claims to update RFC 1808, so presumably it describes the correct behavior.  The patch in this message (I can't upload it on sourceforge here for some reason) brings urljoin's behavior in line with RFC 2396, and changes the appropriate test cases.  I think if you apply this patch this bug can be closed.  Let me know what you think


Index: python/dist/src/Lib/urlparse.py
===================================================================
RCS file: /cvsroot/python/python/dist/src/Lib/urlparse.py,v
retrieving revision 1.39
diff -c -r1.39 urlparse.py
*** python/dist/src/Lib/urlparse.py	7 Jan 2003 02:09:16 -0000	1.39
--- python/dist/src/Lib/urlparse.py	3 Feb 2003 20:51:08 -0000
***************
*** 157,169 ****
      if path[:1] == '/':
          return urlunparse((scheme, netloc, path,
                             params, query, fragment))
!     if not path:
!         if not params:
!             params = bparams
!             if not query:
!                 query = bquery
          return urlunparse((scheme, netloc, bpath,
!                            params, query, fragment))
      segments = bpath.split('/')[:-1] + path.split('/')
      # XXX The stuff below is bogus in various ways...
      if segments[-1] == '.':
--- 157,165 ----
      if path[:1] == '/':
          return urlunparse((scheme, netloc, path,
                             params, query, fragment))
!     if not (path or params or query):
          return urlunparse((scheme, netloc, bpath,
!                            bparams, bquery, fragment))
      segments = bpath.split('/')[:-1] + path.split('/')
      # XXX The stuff below is bogus in various ways...
      if segments[-1] == '.':
Index: python/dist/src/Lib/test/test_urlparse.py
===================================================================
RCS file: /cvsroot/python/python/dist/src/Lib/test/test_urlparse.py,v
retrieving revision 1.11
diff -c -r1.11 test_urlparse.py
*** python/dist/src/Lib/test/test_urlparse.py	6 Jan 2003 20:27:03 -0000	1.11
--- python/dist/src/Lib/test/test_urlparse.py	3 Feb 2003 20:51:12 -0000
***************
*** 54,59 ****
--- 54,63 ----
              self.assertEqual(urlparse.urlunparse(urlparse.urlparse(u)), u)
  
      def test_RFC1808(self):
+         # updated by RFC 2396
+ #        self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
+ #        self.checkJoin(RFC1808_BASE, ';x', 'http://a/b/c/d;x')
+ 
          # "normal" cases from RFC 1808:
          self.checkJoin(RFC1808_BASE, 'g:h', 'g:h')
          self.checkJoin(RFC1808_BASE, 'g', 'http://a/b/c/g')
***************
*** 61,74 ****
          self.checkJoin(RFC1808_BASE, 'g/', 'http://a/b/c/g/')
          self.checkJoin(RFC1808_BASE, '/g', 'http://a/g')
          self.checkJoin(RFC1808_BASE, '//g', 'http://g')
-         self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
          self.checkJoin(RFC1808_BASE, 'g?y', 'http://a/b/c/g?y')
          self.checkJoin(RFC1808_BASE, 'g?y/./x', 'http://a/b/c/g?y/./x')
          self.checkJoin(RFC1808_BASE, '#s', 'http://a/b/c/d;p?q#s')
          self.checkJoin(RFC1808_BASE, 'g#s', 'http://a/b/c/g#s')
          self.checkJoin(RFC1808_BASE, 'g#s/./x', 'http://a/b/c/g#s/./x')
          self.checkJoin(RFC1808_BASE, 'g?y#s', 'http://a/b/c/g?y#s')
-         self.checkJoin(RFC1808_BASE, ';x', 'http://a/b/c/d;x')
          self.checkJoin(RFC1808_BASE, 'g;x', 'http://a/b/c/g;x')
          self.checkJoin(RFC1808_BASE, 'g;x?y#s', 'http://a/b/c/g;x?y#s')
          self.checkJoin(RFC1808_BASE, '.', 'http://a/b/c/')
--- 65,76 ----
***************
*** 103,111 ****
      def test_RFC2396(self):
          # cases from RFC 2396
  
!         ### urlparse.py as of v 1.32 fails on these two
!         #self.checkJoin(RFC2396_BASE, '?y', 'http://a/b/c/?y')
!         #self.checkJoin(RFC2396_BASE, ';x', 'http://a/b/c/;x')
  
          self.checkJoin(RFC2396_BASE, 'g:h', 'g:h')
          self.checkJoin(RFC2396_BASE, 'g', 'http://a/b/c/g')
--- 105,113 ----
      def test_RFC2396(self):
          # cases from RFC 2396
  
!         # conflict with RFC 1808, tests commented out there
!         self.checkJoin(RFC2396_BASE, '?y', 'http://a/b/c/?y')
!         self.checkJoin(RFC2396_BASE, ';x', 'http://a/b/c/;x')
  
          self.checkJoin(RFC2396_BASE, 'g:h', 'g:h')
          self.checkJoin(RFC2396_BASE, 'g', 'http://a/b/c/g')


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-03-22 21:34

Message:
Logged In: YES 
user_id=44345

added Aaron's RFC 2396 tests to test_urlparse.py
version 1.4 - the two failing tests are commented out



----------------------------------------------------------------------

Comment By: Jon Ribbens (jribbens)
Date: 2002-03-18 06:22

Message:
Logged In: YES 
user_id=76089

I think it would be better btw if '..' components taking 
you 'off the top' were stripped. RFC 2396 says this is 
valid behaviour, and it's what 'real' browsers do.

i.e.
  http://a/b/ + ../../../d == http://a/d


----------------------------------------------------------------------

Comment By: Aaron Swartz (aaronsw)
Date: 2001-11-05 10:34

Message:
Logged In: YES 
user_id=122141

Oops, meant to attach it...

----------------------------------------------------------------------

Comment By: Aaron Swartz (aaronsw)
Date: 2001-11-05 10:30

Message:
Logged In: YES 
user_id=122141

Sure, here they are:



import urlparse

base = 'http://a/b/c/d;p?q'

assert urlparse.urljoin(base, 'g:h') == 'g:h'
assert urlparse.urljoin(base, 'g') ==   'http://a/b/c/g'
assert urlparse.urljoin(base, './g') == 'http://a/b/c/g'
assert urlparse.urljoin(base, 'g/') ==  'http://a/b/c/g/'
assert urlparse.urljoin(base, '/g') ==  'http://a/g'
assert urlparse.urljoin(base, '//g') == 'http://g'
assert urlparse.urljoin(base, '?y') ==  'http://a/b/c/?y'
assert urlparse.urljoin(base, 'g?y') == 'http://a/b/c/g?y'
assert urlparse.urljoin(base, '#s') ==  'http://a/b/c/
d;p?q#s'
assert urlparse.urljoin(base, 'g#s') == 'http://a/b/c/g#s'
assert urlparse.urljoin(base, 'g?y#s') == 'http://a/b/c/
g?y#s'
assert urlparse.urljoin(base, ';x') == 'http://a/b/c/;x'
assert urlparse.urljoin(base, 'g;x') ==  'http://a/b/c/g;x'
assert urlparse.urljoin(base, 'g;x?y#s') == 'http://a/b/c/
g;x?y#s'
assert urlparse.urljoin(base, '.') ==  'http://a/b/c/'
assert urlparse.urljoin(base, './') ==  'http://a/b/c/'
assert urlparse.urljoin(base, '..') ==  'http://a/b/'
assert urlparse.urljoin(base, '../') ==  'http://a/b/'
assert urlparse.urljoin(base, '../g') ==  'http://a/b/g'
assert urlparse.urljoin(base, '../..') ==  'http://a/'
assert urlparse.urljoin(base, '../../') ==  'http://a/'
assert urlparse.urljoin(base, '../../g') ==  'http://a/g'

assert urlparse.urljoin(base, '') == base

assert urlparse.urljoin(base, '../../../g')    ==  'http://a/../g'
assert urlparse.urljoin(base, '../../../../g') ==  'http://a/../../g'

assert urlparse.urljoin(base, '/./g') ==  'http://a/./g'
assert urlparse.urljoin(base, '/../g')         ==  'http://a/../g'
assert urlparse.urljoin(base, 'g.')            ==  'http://a/b/c/
g.'
assert urlparse.urljoin(base, '.g')            ==  'http://a/b/c/
.g'
assert urlparse.urljoin(base, 'g..')           == 'http://a/b/c/
g..'
assert urlparse.urljoin(base, '..g')           == 'http://a/b/c/
..g'

assert urlparse.urljoin(base, './../g')        ==  'http://a/b/g'
assert urlparse.urljoin(base, './g/.')         ==  'http://a/b/c/
g/'
assert urlparse.urljoin(base, 'g/./h')         ==  'http://a/b/c/
g/h'
assert urlparse.urljoin(base, 'g/../h')        ==  'http://a/b/c/
h'
assert urlparse.urljoin(base, 'g;x=1/./y')     ==  
'http://a/b/c/g;x=1/y'
assert urlparse.urljoin(base, 'g;x=1/../y')    ==  'http://a/b/
c/y'

assert urlparse.urljoin(base, 'g?y/./x')       ==  
'http://a/b/c/g?y/./x'
assert urlparse.urljoin(base, 'g?y/../x')      == 
'http://a/b/c/g?y/../x'
assert urlparse.urljoin(base, 'g#s/./x')       ==  'http://a/b/
c/g#s/./x'
assert urlparse.urljoin(base, 'g#s/../x')      ==  'http://a/b/
c/g#s/../x'



----------------------------------------------------------------------

Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2001-11-05 10:05

Message:
Logged In: YES 
user_id=3066

This looks like its probably related to #478038; I'll try to
tackle them together.  Can you attach your tests to the bug
report on SF?  Thanks!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=450225&group_id=5470