[New-bugs-announce] [issue37969] urllib.parse functions reporting false equivalent URIs
Géry
report at bugs.python.org
Wed Aug 28 10:54:51 EDT 2019
New submission from Géry <gery.ogam at gmail.com>:
The Python library documentation of the `urllib.parse.urlunparse <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunparse>`_ and `urllib.parse.urlunsplit <https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlunsplit>`_ functions states:
This may result in a slightly different, but equivalent URL, if the URL that was parsed originally had unnecessary delimiters (for example, a ? with an empty query; the RFC states that these are equivalent).
So with the <http://example.com/?> URI::
>>> import urllib.parse
>>> urllib.parse.urlunparse(urllib.parse.urlparse("http://example.com/?"))
'http://example.com/'
>>> urllib.parse.urlunsplit(urllib.parse.urlsplit("http://example.com/?"))
'http://example.com/'
But `RFC 3986 <https://tools.ietf.org/html/rfc3986?#section-6.2.3>`_ states the exact opposite:
Normalization should not remove delimiters when their associated component is empty unless licensed to do so by the scheme specification. For example, the URI "http://example.com/?" cannot be assumed to be equivalent to any of the examples above. Likewise, the presence or absence of delimiters within a userinfo subcomponent is usually significant to its interpretation. The fragment component is not subject to any scheme-based normalization; thus, two URIs that differ only by the suffix "#" are considered different regardless of the scheme.
So maybe `urllib.parse.urlunparse` ∘ `urllib.parse.urlparse` and `urllib.parse.urlunsplit` ∘ `urllib.parse.urlsplit` are not supposed to be used for `syntax-based normalization <https://tools.ietf.org/html/rfc3986?#section-6>`_ of URIs. But still, both `urllib.parse.urlparse` or `urllib.parse.urlsplit` lose the "delimiter + empty component" information of the URI string, so they report false equivalent URIs::
>>> import urllib.parse
>>> urllib.parse.urlparse("http://example.com/?") == urllib.parse.urlparse("http://example.com/")
True
>>> urllib.parse.urlsplit("http://example.com/?") == urllib.parse.urlsplit("http://example.com/")
True
P.-S. — Is there a syntax-based normalization function of URIs in the Python library?
----------
components: Library (Lib)
messages: 350663
nosy: Jeremy.Hylton, maggyero, orsenthil
priority: normal
severity: normal
status: open
title: urllib.parse functions reporting false equivalent URIs
type: behavior
versions: Python 3.7
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37969>
_______________________________________
More information about the New-bugs-announce
mailing list