On Sun, Jul 21, 2019 at 2:50 PM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Jul 21, 2019, at 14:13, Barry <barry@barrys-emacs.org> wrote:
>>> On 21 Jul 2019, at 19:03, Steven D'Aprano <steve@pearwood.info> wrote:
>>> On Sun, Jul 21, 2019 at 08:48:49AM +0100, Barry Scott wrote:
>>> I took at very quick look at bpo30500 and was struck by the comment
>>> that the code was working on a URL that had not been validated.

The more worrisome comment is one from Victor Stinner, when he did not know whether a change was intentional or not. https://bugs.python.org/msg296422. That showed a real disconnect between code, and spec. And a parsing library should be able to help with that, provided it is implemented correctly. So, I think we are in agreement in that regard.

>>> Validation of the URL would reject the URL before the parsing happens
>>> in this case. Was that the case?
>> Sorry, can you elaborate on that? How do you validate a URL without
>> attempting to parse it? You're surely not talking about looking it up in
>> a whitelist are you?
> I was thinking about ensuring the the characters in the url are from the subset that is allowed. \n is not allowed for example. Yes agree you have a try to parse it.

For a spec that has different sets of restricted characters for different parts, that kind of prevalidation doesn’t seem like it would get you very far. At least a priori, if there are attacks that involve using illegal characters in the netloc or the path or the scheme or whatever, they could just as easily be characters that are legal elsewhere in the URL as characters that happen to not be  legal anywhere.

Yes, absolutely. For simple cases, I just don't see how string.split, or re.match can be brushed aside. But more complex cases, proper parsing will certainly help with validation.

FYI, my current proof of concept parser is at ~300 lines of code, with debugging trace support. Other than performance (which I don't intend to tackle in my library very soon), is there any other concern that I have missed? At the moment, I am still of the opinion that the goal raised in this thread is very attainable, and should be considered.