A urlparse regression in minor version
Hello Python-Committers,
In https://bugs.python.org/issue27657, I introduced a regression in a minor release. The original patch to parsing logic of URL, cleaned up a lot of corner cases (https://github.com/python/cpython/pull/661) and I felt good about the change. However, the mistake was with the backport.
Demo:
$ ./python Python 3.8.0 (default, Feb 10 2020, 06:15:43) [GCC 9.2.1 20191008] on linux Type "help", "copyright", "credits" or "license" for more information.
$ ./python Python 3.8.1+ (heads/3.8:b086ea5edc, Feb 10 2020, 06:15:44) [GCC 9.2.1 20191008] on linux Type "help", "copyright", "credits" or "license" for more information.
When I read an associated bug report against a user of the software (like this: https://github.com/mozilla/bleach/issues/503) - I feel that this was a mistake. The change of test-suite in minor versions should have alerted me, but I seem to have missed it.
I am planning to revert this change in 3.8.2 and 3.7.7 Should I highlight this in any documentation? Thoughts and opinions?
Thank you, Senthil
I'll let others voice their opinions but my intuition for 3.8.x is to leave your patch be. True, it should not have been backported but it was, and it was already released as part of 3.8.1 and now 3.8.2rc1.
The user will have to special-case the change in behavior anyway. I feel like it is easier for them to do so by only special-casing "sys.version_info < (3, 8, 1)". The alternative if you revert your patch will be to special case both "sys.version_info < (3, 9)" and "sys.version_info == (3, 8, 1)" (let alone 3.8.2rc1). That is worse I think.
So, the only thing to do is to clearly document that there was a change, not in 3.9 but in 3.8.1 and forward.
Agreed that it's a bit of a mess for 3.7 since that was way later in the release cadence with only 3.7.6 getting the change. I will let Ned decide here.
- Ł
Thanks for sharing your thoughts.
I was originally thinking that if the problem is contained only to 3.8.1 and 3.7.6 and, reverted 3.8.2. If revert is not an option, then, I can add either of these in the documentation as the recommendation to deal with this problem.
- Special case the "sys.version_info < (3, 8, 1)" for urlparse behavior.
- Recommend making sure that urlparse is used on URL with the scheme. If the URL had a scheme ("http" or anything supported), the parsing behavior remains consistent.
It is only with schema-less URL this backport broke the parsing and upset maintainers of certain utilities (correctly).
On Tue, Feb 11, 2020 at 5:26 AM Łukasz Langa <lukasz@langa.pl> wrote:
On Thu, Feb 13, 2020 at 1:57 AM Antoine Pitrou <antoine@python.org> wrote:
This. Because the bug-fix was a bringing in backward-incompatible change that certain users are not happy with. https://github.com/mozilla/bleach/issues/503
No. The change introduced (bug-fix) was helpful in terms of consistent parsing and further improvement from maintainers pov. A safer fix might essentially mean revert to the old behavior.
The original bug(s) were about parsing expectations of an ill-defined 'scheme-less' URLs. Developers who relied on older behavior are surprised.
On Feb 11, 2020, at 05:26, Łukasz Langa <lukasz@langa.pl> wrote:
I'll let others voice their opinions but my intuition for 3.8.x is to leave your patch be. True, it should not have been backported but it was, and it was already released as part of 3.8.1 and now 3.8.2rc1.
I don’t think you should worry about 3.8.2rc1. That’s not a release version so shouldn’t impact the decision IMHO.
Whether 3.8.1 is important enough not to revert this change is an RM decision.
-Barry
Hi Łukasz,
As we have to a decision here, my vote is to revert the patch in 3.8.2 and 3.7.7 I have gone back-and-forth with this thinking, and it seems revert might address some definite complaints we have got. The problem is contained to single version, and users can upgrade to the next one.
Thanks, Senthil
On Fri, Feb 14, 2020 at 8:14 AM Łukasz Langa <lukasz@langa.pl> wrote:
On Feb 15, 2020, at 09:00, Senthil Kumaran <senthil@uthcode.com> wrote:
On Fri, Feb 14, 2020 at 8:14 AM Łukasz Langa <lukasz@langa.pl> wrote:
Ned, what are you doing with this for 3.7.7? Reverting?
Ugh!
As others have noted, urlparse is a big can of worms. I am certainly not a subject expert but, from some investigation and thinking about it, it seems to me that we kinda brought this on ourselves by allowing the scheme part (e.g. "https:" or "ftp:" or "any-old-scheme:" etc) of the urlstring parameter to be optional:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
therby introducing the ambiguity of whether a string like "localhost:80" denotes a relative url with a user-defined scheme of "localhost" and a path of "80" (as it now does with the changes for bpo-27657 introduced in 3.8.1 and 3.7.6):
urlparse("localhost:80") ParseResult(scheme='localhost', netloc='', path='80', params='', query='', fragment='')
or denotes a relative url with no scheme and a path of "localhost:80" (as happened in previous releases):
urlparse("localhost:80") ParseResult(scheme='', netloc='', path='localhost:80', params='', query='', fragment='')
With an explicit scheme, in either case you get what you would expect - an absolute url:
urlparse("http://localhost:80") ParseResult(scheme='http', netloc='localhost:80', path='', params='', query='', fragment='')
AFAICT the intent of the original RFCs was to require an explicit scheme in a urlstring, thus avoiding any ambiguity. But the now universal practice of web browsers supplying a default http: or https: scheme for (partial) urls typed into a location bar has understandably changed user expectations to often be that schemes are optional when the scheme is clear in context.
So it seems to me that there is no one obviously correct behavior here. Judging from the comments and the reports of broken packages, many users are clearly used to using urlparse with schemeless urlstrings even if they aren't truly conformant URLs and even with the at first glance unintuitive way they were parsed by urlparse; for example, there is this snippet in the third-party requests package:
# urlparse is a finicky beast, and sometimes decides that there isn't a
# netloc present. Assume that it's being over-cautious, and switch netloc
# and path if urlparse decided there was no netloc.
if not netloc:
netloc, path = path, netloc
OTOH, there are also undoubtedly users who want a urlparser that more strictly parses schemeless URLs, which is now the behavior as of 3.8.1 and 3.7.6, again, even if the new behavior is also unintuitive.
I don't see how we can satisfy both use cases without changing the API somehow. And there may be other use cases.
The good news is that, AFAICT from a quick survey, the change didn't affect urllib.urlopen or thrid-party urllib3 or requests. But from the "me-toos" on the bpo issue and the PR, it's clear that we broke stuff downstream and it seems that most of those users are waiting for a resolution from us and likely would prefer to stick to the previous behavior.
So my take is that we should revert the 3.7 changes (bpo-27657 / PR 16837 / 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f ). Senthil, please go ahead and do so for the 3.7 branch. Thanks!
While it's not my call, I think we should also revert for 3.8.2.
For 3.9.0, I recommend we reconsider this change (temporarily reverting it) and consider whether an API change to accommodate the various use cases would be better; perhaps something like adding a new parameter to urlparse to indicate whether urlstrings should be parsed like webbrowser "urls" (and defining exactly what that means) and also review the many remaining open urlparse bpo issues to look for commonalities. (Perhaps that could be a post-3.9 GsoC project?)
Thoughts?
In any case, ugh!
-- Ned Deily nad@python.org -- []
On Sun, Feb 16, 2020 at 2:20 AM Ned Deily <nad@python.org> wrote:
For 3.9. - I am ready to defend the patch even at the cost of the breaking of the parsing of undefined behavior. We should keep it. The patch simplifies a lot of corner cases and fixes the reported bugs. We don't guarantee backward compatibility between major versions, so I assume users will be careful when relying upon this undefined behavior and will take corrective action on their side before upgrading to 3.9.
We want patch releases to be backward compatible. That was the user-complaint.
Thanks, Senthil
FWIW, I agree with Senthil here. A slight behaviour change in 3.9 is fine, especially in an area where the "right" semantics are not immediately obvious. What we want to avoid is breaking behaviour changes in bugfix releases.
Regards
Antoine.
Le 16/02/2020 à 13:13, Senthil Kumaran a écrit :
On Feb 16, 2020, at 07:21, Antoine Pitrou <antoine@python.org> wrote:
I agree totally that we don't want to break behavior in bugfix releases and I have no problem with making breaking changes in feature releases (3.9.0) as warranted.
My point was that, after looking at this a bit, it seems to me that making this change does not address some of the underlying problems with the urlparse API and that it makes things *much* worse for the many users who are understandably expecting urlparse to sanely handle schemaless urlstrings, the most commonly seen urls format today.
Note that we strongly imply that we sanely handle them by offering the "scheme=" paramater to urlparse. Another example: prior to 3.7.6 and 3.8.1:
urlparse("www.google.com:8080", scheme="http") ParseResult(scheme='http', netloc='', path='www.google.com:8080', params='', query='', fragment='')
That isn't what users would expect; what they would expect is how things work with an explicit scheme (note the swapping of netloc and path).
urlparse("https://www.google.com:8080", scheme="http") ParseResult(scheme='https', netloc='www.google.com:8080', path='', params='', query='', fragment='')
But at least there is a relatively simple workaround that users have discovered as witnessed by the requests code snippet I cited earlier: use the path field if netloc is empty.
Now with the change in 3.8.1 and 3.7.6, the behavior is very different and pretty useless even with an explicit scheme="http" parameter:
urlparse("www.google.com:8080", scheme="http") ParseResult(scheme='www.google.com', netloc='', path='8080', params='', query='', fragment='')
i.e. www.google.com://8000
While that may be what strict adherence to the RFC dictates, most users aren't going to expect or desire results like that. So while the change may fix some cases, it's only making matters worse. What kind of workaroud do you use for that result?
In another open issue concerning a different urlparse issue, Victor noted that (4 months ago) "there are 124 open issues with "urllib" in their title and 12 open issues with "urlparse" in their title" and hit a bit of a dead end with a proposed fix.
https://bugs.python.org/issue36338#msg355322
Rather than continuing this change in 3.9 introducing yet another, even more unexpected behavior, I think we should first try to address what appears to me to be the (a?) root cause issue: urlparse's API is not suited for parsing both strictly RFC-compliant URLs (which are clearly not well-understood) *and* today's schemeless URLs as have evolved over the years to become the most commonly encountered form of URL. Users want and need both. The merged change makes the previous situation worse, IMHO.
Le 16/02/2020 à 13:13, Senthil Kumaran a écrit :
-- Ned Deily nad@python.org -- []
On Feb 16, 2020, at 10:20, Ned Deily <nad@python.org> wrote:
Rather than continuing this change in 3.9 introducing yet another, even more unexpected behavior, I think we should first try to address what appears to me to be the (a?) root cause issue: urlparse's API is not suited for parsing both strictly RFC-compliant URLs (which are clearly not well-understood) *and* today's schemeless URLs as have evolved over the years to become the most commonly encountered form of URL. Users want and need both. The merged change makes the previous situation worse, IMHO.
ISTM that the tension between doing the right thing and keeping backward compatibility should be explored through the addition of a new function with the intended semantics, or at least a new parameter (keyword-only?) that controls the behavior. I don’t like the latter as much, but if you really want to keep a single functional interface, that would be a way to do it.
I don’t agree that it’s obviously okay to introduce a backward incompatible default behavior in 3.9.
Cheers, -Barry
I have created the PRs for the revert in 3.8.2 and 3.7.7
https://github.com/python/cpython/pull/18525 (3.8) - Acceptance definitely depends on the RM ( Łukasz' s call) and I will support it. https://github.com/python/cpython/pull/18526 (3.7) - Let this be reverted only if 3.8 gets reverted.
@Ned -
Note that we strongly imply that we sanely handle them by offering the "scheme=" parameter to urlparse.
It is a good idea. I will dig deeper with the help of test cases for 3.8. ( At the moment, given the tests cases the only test case which obviously fails the expectation in 'localhost:8000') But it seems that in the bug report, it was discussed and analyzed before the call was taken to simplify the logic of parsing support. So, for 3.9 use case, we can revisit this decision separately and i will account for the points that you have brought up.
If do not end up reverting in 3.8.2 - at the moment, my approach will be to document the behavior and encourage users to make sure that URL's have a scheme for the valid parsing behavior between 3.7, 3.8, and 3.9
On Sun, Feb 16, 2020 at 4:21 AM Antoine Pitrou <antoine@python.org> wrote:
Thank you all. This is reverted in 3.8.2 and 3.7.7 - with NEWS explaining the behavior clearly. For the points that Ned brought up for 3.9, I will continue in the bug report https://bugs.python.org/issue27657 This discussion can be considered complete and closed.
Thanks for the support.
On Sun, Feb 16, 2020 at 12:13 PM Łukasz Langa <lukasz@langa.pl> wrote:
participants (5)
-
Antoine Pitrou
-
Barry Warsaw
-
Ned Deily
-
Senthil Kumaran
-
Łukasz Langa