URLs/URIs + pathlib.Path + literal syntax = ?

The 'Working with Path objects: p-strings?' thread spawned a discussion about how URLs (or more generally URIs) and Paths should work together. I suggest we move that discussion to this new thread. The concept is 'explained' below in this email and quotes, but a little bit of discussion happened in the other thread too. While I think that the decisions about p-strings (or a-strings for addresses or whatever they should be) should keep URIs in mind, it is premature to add the Path+URI fusion into the stdlib. I agree with Paul Moore that this URL stuff should be on PyPI first. It could even be library that monkey patches pathlib to accept URIs. Or a URI library that instantiates Path objects when appropriate. Then there could be a smooth transition into the stdlib some day. See all the stuff below: On Tue, Mar 29, 2016 at 11:44 AM, Sven R. Kunze <srkunze@mail.de> wrote:
Again, that you say you thought about it too perhaps means it's worth discussing :).
Just for the record: "Path" might not be the most correct wording. There is a "file://" scheme which identifies locally located files. So, paths are basically a subset of URLs speaking functionality-wise. Thus, a better/more generic name would be "URL", "URI", "Link" or the like in order to avoid confusing of later generations. However, I think I could live with Path.
Yes, these are concerns that should be considered if/when deciding whether to make URI/URLs a subclass of Path or the other way around, or something else. Anyway, since Path(...) already instantiates different subclasses based on the situation, having it instantiate a URI in some cases would not be completely unnatural. As suggested by Stephen, I've been looking into RFC 3986 as a whole, and it seems that making instantiating both URIs and fs paths from p-strings does not seem completely impossible. Some points below (you can skip past them if you have to, there's more general discussion at the end): - Only some URIs (or even URLs) can be reliably distinguished from file paths. However, those that contain '://' could be automatically turned into URI objects by p-strings [or Path(...)]. I suspect that would cover the majority of use cases. (The unambiguous cases would be exactly those URIs that contain an 'authority' component -- these always begin with 'scheme://' while other's don't) - If we want allow URIs without an 'authority' component, like mailto:someone@domain.com', they should be explicitly instantiated as URI objects. - Some terminology: There are indeed 'URI's and 'relative references'. Relative references are essentially the URI-equivalent of relative paths. Then there are 'URI references' which can be either 'URIs' or 'relative references' (kinda like if you consider general paths that can be absolute or relative paths, as is done in pathlib). - Instantiating relative URI references with Path(...) or p-strings may cause issues, because they might get turned into Windows paths and the like. It does seem like this could be worked around by for instance making another class like "RelativePath" or "RelativeRef", but there are some questions about when/how these should be instantiated. This may lead to a need slight backwards incompatibilities if implemented within pathlib. - "Queries" like '?this=that' after the path component have a special role in URIs, but in file system paths they can be parts of the file (or even directory) name. This might again be ambiguous when using relative paths / references. This could perhaps be dealt with by requiring more explicit handling when joining relative paths / references together. - "Fragments" like '#what'. This is essentially the same issue as with queries above and should be solved the same way. Anyway, both may be present at the same time. - '..' and '.' in relative paths / references. In URIs, there's a difference between 'scheme://foo/bar/' and 'scheme://foo/bar'. Merging the relative reference './baz' to the former gives 'scheme://foo/baz' while merging it to the latter gives 'scheme://foo/bar/baz'. I kinda wish the same thing was the standard with filesystem paths too. - Percent encoding of URIs: quite obvious -- should not be done before it is unambiguous that we deal with an URI. Perhaps it should be done only when the resource is accessed or when the URI is exported to a plain str or bytes etc. I suppose this is matter of what we would want in the repr. - I may still have missed or forgotten something. So, also with paths, especially relative ones, a library should "resist the temptation to guess", and carry around all the information until the context becomes unambiguous. For instance, when merging a relative reference with an explicit URI, the ambiguities about ?query and #fragment and about resolving the merged path disappear.
Another thought: requesting URLs. Basically the same as p'/etc/hosts'.write_text(secret). It's really important to have a dead simple library which is able to work with URLs. So, if I could do:
Good idea. When I suggested extending Paths (and p-strings) to work with URLs, I indeed meant that it would be an instance of (a subclass of) Path, so that you do the same as with filesystem path objects: p'https://mysite.com/somepage.html'.read_text() or (p'https://mysite.com' / page).read_text() But who knows what we might end up with if we go down this path. An I mean a metaphorical path here, not necessarily Path :). Whatever it is, it probably can't be added to the stdlib right away. Still, we could take some measures regarding the language and stdlib now, to prepare for the future. -Koos

Le 29/03/2016 16:42, Koos Zevenhoven a écrit :
Yes but then there is a scope problem: are we providing just the parsing or also convenience method to access the ressource. E.G, you suggested: url('http://foo.com').get() For a ftp url, what would you do ? Ssh ? Why path would have them and not Http. Why http and not ftp ? Why ftp and not mailto: ? And if we do implement get() for http, then urllib ? Or request ? But then what about http 2 ? What about asyncio ? This needs to be sorted out first. Alhough, I do think URLS are very important, as I'm a web dev, integrating p"http://foo.com'.get() seems dangerous. We don't know how the web is going to move, and it's moving fast, while the stdlib is slow.

On Tue, Mar 29, 2016 at 6:07 PM, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
That's the nice thing about URI:s. They tell you if it's http or ftp, so the library can decide how to do a read_text() or whatever, if it is something that makes sense for that kind of URI. Already with filesystem Paths you have situations where you can't do some operation, for instance because of permissions, even if the Path points to something that exists, and still those methods exist. They will just fail. That's life. Trying to read_text() on a mailto URI should fail too.
And if we do implement get() for http, then urllib ? Or request ? But then what about http 2 ? What about asyncio ?
Yes, the asyncio / blocking io is a whole other issue. In fact, I started a thread about that almost a year ago, but I think my timing was really bad, since the async/await PEP 492 was just about to be accepted.
I completely agree it's important to try to look into the future. However, as long as we believe the meaning of read_text() or get() will not change, how much harm can we do? I'm sure reading a text file or query from a URL is not going to disappear any time soon. - Koos

On 03/29/2016 07:42 AM, Koos Zevenhoven wrote:
Pathlib is already complicated; unless we would be doing the same types of operations, and have the same mental model, there would be no point in trying to support URIs with Pathlib. If (a big if) we do add URI support, there would be no danger of mixing file paths with URI paths as you must specify which one you want: - WindowsPath - PosixPath - UriPath -- ~Ethan~

On Tue, Mar 29, 2016 at 6:31 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Agreed. I think the main benefits besides flexibility would indeed be to be able to use the same mental model (and added syntax?) with Paths and URIs as long as you do operations that make sense with the different 'addresses'. If you need something more specific, the subclasses can still add whatever features make sense for the given scheme/protocol. - Koos

That's some post. Thanks a lot for collecting all that stuff. On 29.03.2016 16:42, Koos Zevenhoven wrote:
:)
You are right of course. I thought too narrow in this case.
Well done.
I agree. 'http' and 'https' would make the majority of schemes, when it comes to the Web. 'ftp', 'ssh' and 'mailto' might follow.
All of this makes me think that it MIGHT be better to leave the decision of whether it's a *real path*, a *URL* or a *URL path* to the user. Not sure if we can handle this lazily but I CAN imagine some "confusion" *sounding like PEP 428*. If we can found an unambiguous solution, that'll be awesome and would simplify a lot.
Good point. URIs should be able to handle both inputs. So, we would need to decide on a canonical form.
Another point for "let the user decide".
Ah, I see. Well, that's one approach. On the other hand, I can imagine a lot of people willing to do a "PUT", "DELETE" or "POST" (and the rather unknown other ones). It seems to me that a one-to-one mapping would be easier here instead of retrofitting. Although read_text might come in handy as an alias for "GET". :) That is when you don't care if you read locally or remotely. So, I can see room for this.
But who knows what we might end up with if we go down this path. An I mean a metaphorical path here, not necessarily Path :).
Let's see where this path lead us. ;) Best, Sven

On Tue, Mar 29, 2016 at 6:54 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Yeah, if subclassing Path, when the URI object gets instantiated, then the object can have all kinds of specialized methods and behavior. I mean, when the address string starts with 'https://' [*], it can instantiate a HttpsURI object or something, which provides Path stuff like .read_text() as well as more https-specific methods. [*] "https://" does not make a whole lot of sense in (the beginning of) a file-system path, even if pathlib.Path currently pretends that it does.
Yes :). - Koos

On Tue, Mar 29, 2016, at 12:44, Koos Zevenhoven wrote:
What about a read_json? Should the underlying request set Accept types accordingly (read_json demands text/json or application/json, read_text demands text/plain? what if I want HTML? Need another method for that?) What if content-transfer-encoding isn't set, or the data contains bytes that aren't valid for it?

On Wed, Mar 30, 2016 at 9:17 AM, Michael Selik <mike@selik.org> wrote:
How is that "even worse"? It's the exact same thing. (You might need "mkdir -p" to make this work, as it'll need to create more than one directory.) You have a directory called "ftp:", and inside that, a directory called "www.example.com". rosuav@sikorsky:~/tmp$ mkdir -p ftp://www.example.com rosuav@sikorsky:~/tmp$ find . . ./ftp: ./ftp:/www.example.com rosuav@sikorsky:~/tmp$ But this is a case where the weird edge cases can be dealt with specially, IMO. There are a *lot* of programs which cannot easily handle a file name that begins with a hyphen - the solution is to force a different interpretation, either with a double-hyphen "end of options", or by using "./-rf" as the file name. The same could be done here; in the extremely rare situation where you actually want to start your path with "http:" and then another directory, you have three options: 1) Path("./http://www.example.com") 2) Path("http:/www.example.com") 3) Path("file://http://www.example.com") For scripts that need 100% dependable parsing, the third option will be guaranteed to work. For normal usage, compressing the double slash to a single one will have the right effect AND canonicalize the name. This should be safe. ChrisA

Chris Angelico writes:
No, the third should crap out with a syntax error on the colon, see [1], which does not allow a port spec at all, and RFC 3986, which doesn't allow colon in the host name ([1] references RFC 3986 for the syntax of the host name). Specifying the host to a "file:" URI gives locally-defined behavior (eg, a Windows share), but in the most recent attempt to deal with exactly these issues[1], it is legal. The correct syntaxes per [1] and RFC 3986 are 4) Path("file:///http://www.example.com") 5) Path("file://localhost/http://www.example.com") 6) Path("file://[127.0.0.1]/http://www.example.com") 7) Path("file://[::1]/http://www.example.com") As far as I can tell the colon in "http:" is RFC 3986-legal, since it has no URI syntactic meaning in the path component. This isn't as easy as it looks (which is why people are trying to delegate it to something they think of as "simple"). There's an additional problem with trying to cram URIs and Path together, which is that in a file system, "/a/b/symlink/../c" may refer to any file system object depending on symlink's target which is unknown, while as an URI path it refers to whatever "/a/b/c" refers to, and nothing else. (This is the semantic glitch I was thinking of earlier.) This means that URIs can be canonicalized syntactically, while doing so with file system paths is risky. Footnotes: [1] https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme-06

On Wed, Mar 30, 2016 at 3:06 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Oops, my bad - I forgot about the third slash. It comes to the same thing, though; for most paths, you can deduce that a prefix "http://" implies that it's not a file path, and for the rare case when you do mean that, you can explicitly adorn it. (This is a good reminder that code and specs should not be created by one person firing off an email. This needs someone - preferably multiple people - checking the appropriate specs. Get it right, folks, don't trust me!)
Or there are two operations: canonicalizing by components, and rendering a "true path", which requires file system access (stat every component). ChrisA

Chris Angelico writes:
On Wed, Mar 30, 2016 at 3:06 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
That's not to my taste because while you do need to make that choice for filesystem paths, it's always safe (in the sense of "you're buggy, not me!") to canonicalize a URI. Also, RFC 3896 explicitly refuses to require URIs to make any physical sense, while it's an error for a filesystem path to refer to a non-existent object. In other words, URIs are abstract syntax to which you can assign whatever semantics you want (as long as they are compatible with the syntax), while filesystem paths are actual paths in a graph that is instantiated in hardware storage. I suspect you won't have a problem with that distinction, but ISTM that it's the exact opposite of the way the "let's derive Path from str" (or from "BaseString" or whatever) crowd want to think (filesystem paths are a subset of strings).

On Wed, Mar 30, 2016 at 6:42 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Yes, this is true; for a non-file-system URI, the "true path" would simply be identical to the component-based canonicalization.
Hmm. Not sure what you mean by that error - how else could you create a new file, than by identifying a path that does not exist? You're absolutely right that URIs have no specific physical meaning, yet they do have definite semantic meaning; an HTTP client can (and should) resolve relative URLs itself, rather than simply sticking "../../static/main.css" on the end of the current URL.
Even if a file system path is a special form of string, this will work. It's just that the semantics would have to be defined sans canonicalization, with two methods that do that. But I am liking the line of thinking (not posted in this exact thread, so I'm not sure who said it - sorry!) that Paths are as abstract as integers and text strings, and that they should have no particular tie to their mechanical implementation. The string "/home/rosuav/foo.py" should be more like a "path display" (by analogy with lists etc) than an actual Path. ChrisA

On 30 March 2016 at 05:23, Chris Angelico <rosuav@gmail.com> wrote:
I presume you're deliberately ignoring the fact that most paths come from variables, not from literals, and many come from user input (sometimes even unintended user input such as a "*" expanded by the shell)? It's easy enough to rewrite literals to be unambiguous, but in order to do so for arbitrary input you need to basically implement (part of) a URI parser... Paul

On Wed, Mar 30, 2016 at 6:55 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Not quite; what I'm saying is that *any* file path can be made unambiguous by prepending "file:///", thus guaranteeing that it will be parsed correctly. I'm not sure that this is necessarily a good thing, but it's in response to the objection that a magic prefix "http://" would introduce an impossibility. ChrisA

On 30 March 2016 at 11:07, Chris Angelico <rosuav@gmail.com> wrote:
I don't know if that's true for Windows paths. You'd need to switch backslashes to slashes, for a start (and *not* do that on Unix, where backslash is a valid filename character, albeit a silly one to use). And the URL syntax for drive letters is at best inconsistent across applications, and at worst undefined (I don't know if the standards define it, but if they do, I do know that not all applications follow the same rules). Paul

Paul Moore writes:
AFAICS it only matters for the file scheme, for which see https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme. That draft describes how several apps handle drive letters.

On Wed, Mar 30, 2016 at 10:55 AM, Paul Moore <p.f.moore@gmail.com> wrote:
If a user input (or config file) says "http://www.example.com/page.html", what does the user want? The http URL or the relative file-system path "http:/www.example.com/page.html"? How often are users *explicitly* asked for a *file-system path*? And if they notice that 'http://...' does not by default refer to the file system, who will complain? Maybe someone can imagine a realistic situation where interpreting "scheme://..." interpreted as an URI causes problems? - Koos

On Wed, Mar 30, 2016 at 7:06 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote: [...]
Even if correct, these do not refer to "http:/www.example.com", but to "/http:/www.example.com". An URI with a relative path would not make a lot of sense, because its meaning would depend on the context, which is against. Then again, all file system paths are 'relative' with respect to the file system you are working in. Also, while RFC 3986 is not super clear about this, I think '//' inside a URI path component may cause problems. IIUC this leads to a zero-length path segment '' in between the two slashes. It might work though if it it just gets passed forward to the file system in the end. I don't know if that can 'officially' be normalized to a single slash though. "URIs that identify in relation to the end-user's local context should only be used when the context itself is a defining aspect of the resource, such as when an on-line help manual refers to a file on the end- user's file system (e.g., "file:///etc/hosts")." - RFC 3986
As far as I can tell the colon in "http:" is RFC 3986-legal, since it has no URI syntactic meaning in the path component.
That's right; per RFC 3986, colons are allowed in a URI path component, even if it is disallowed in *the first path segment* of a *relative reference*, which I assume is to make relative references unambiguous as *URI references* which can be URIs or relative references. That is, a URI reference "mailto:email@address.com" is a mailto-URL and not a relative reference equivalent to "./mailto:email@address.com". So basically, if you want to express the (ridiculous) path 'http:/www.example.com' as a relative reference, you'd need to do './http:/www.example.com'.
This is an interesting issue, because the behavior is not implemented consistently: k7hoven@pomelo ~ % mkdir -p foo/bar k7hoven@pomelo ~ % ln -s foo/bar baz k7hoven@pomelo ~ % cd baz/.. k7hoven@pomelo ~ % cd baz k7hoven@pomelo ~/baz % cd .. k7hoven@pomelo ~ % echo "am I in foo/ or in ~/ ?" > baz/../question.txt k7hoven@pomelo ~ % cat question.txt cat: question.txt: No such file or directory k7hoven@pomelo ~ % cat foo/question.txt am I in foo/ or in ~/ ?
This means that URIs can be canonicalized syntactically, while doing so with file system paths is risky.
And that URI normalization should not be done automatically, especially if it is not clear if it's an URI or not. Then sometimes you also want to do scheme-specific normalization. -Koos

Le 29/03/2016 16:42, Koos Zevenhoven a écrit :
Yes but then there is a scope problem: are we providing just the parsing or also convenience method to access the ressource. E.G, you suggested: url('http://foo.com').get() For a ftp url, what would you do ? Ssh ? Why path would have them and not Http. Why http and not ftp ? Why ftp and not mailto: ? And if we do implement get() for http, then urllib ? Or request ? But then what about http 2 ? What about asyncio ? This needs to be sorted out first. Alhough, I do think URLS are very important, as I'm a web dev, integrating p"http://foo.com'.get() seems dangerous. We don't know how the web is going to move, and it's moving fast, while the stdlib is slow.

On Tue, Mar 29, 2016 at 6:07 PM, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
That's the nice thing about URI:s. They tell you if it's http or ftp, so the library can decide how to do a read_text() or whatever, if it is something that makes sense for that kind of URI. Already with filesystem Paths you have situations where you can't do some operation, for instance because of permissions, even if the Path points to something that exists, and still those methods exist. They will just fail. That's life. Trying to read_text() on a mailto URI should fail too.
And if we do implement get() for http, then urllib ? Or request ? But then what about http 2 ? What about asyncio ?
Yes, the asyncio / blocking io is a whole other issue. In fact, I started a thread about that almost a year ago, but I think my timing was really bad, since the async/await PEP 492 was just about to be accepted.
I completely agree it's important to try to look into the future. However, as long as we believe the meaning of read_text() or get() will not change, how much harm can we do? I'm sure reading a text file or query from a URL is not going to disappear any time soon. - Koos

On 03/29/2016 07:42 AM, Koos Zevenhoven wrote:
Pathlib is already complicated; unless we would be doing the same types of operations, and have the same mental model, there would be no point in trying to support URIs with Pathlib. If (a big if) we do add URI support, there would be no danger of mixing file paths with URI paths as you must specify which one you want: - WindowsPath - PosixPath - UriPath -- ~Ethan~

On Tue, Mar 29, 2016 at 6:31 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
Agreed. I think the main benefits besides flexibility would indeed be to be able to use the same mental model (and added syntax?) with Paths and URIs as long as you do operations that make sense with the different 'addresses'. If you need something more specific, the subclasses can still add whatever features make sense for the given scheme/protocol. - Koos

That's some post. Thanks a lot for collecting all that stuff. On 29.03.2016 16:42, Koos Zevenhoven wrote:
:)
You are right of course. I thought too narrow in this case.
Well done.
I agree. 'http' and 'https' would make the majority of schemes, when it comes to the Web. 'ftp', 'ssh' and 'mailto' might follow.
All of this makes me think that it MIGHT be better to leave the decision of whether it's a *real path*, a *URL* or a *URL path* to the user. Not sure if we can handle this lazily but I CAN imagine some "confusion" *sounding like PEP 428*. If we can found an unambiguous solution, that'll be awesome and would simplify a lot.
Good point. URIs should be able to handle both inputs. So, we would need to decide on a canonical form.
Another point for "let the user decide".
Ah, I see. Well, that's one approach. On the other hand, I can imagine a lot of people willing to do a "PUT", "DELETE" or "POST" (and the rather unknown other ones). It seems to me that a one-to-one mapping would be easier here instead of retrofitting. Although read_text might come in handy as an alias for "GET". :) That is when you don't care if you read locally or remotely. So, I can see room for this.
But who knows what we might end up with if we go down this path. An I mean a metaphorical path here, not necessarily Path :).
Let's see where this path lead us. ;) Best, Sven

On Tue, Mar 29, 2016 at 6:54 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Yeah, if subclassing Path, when the URI object gets instantiated, then the object can have all kinds of specialized methods and behavior. I mean, when the address string starts with 'https://' [*], it can instantiate a HttpsURI object or something, which provides Path stuff like .read_text() as well as more https-specific methods. [*] "https://" does not make a whole lot of sense in (the beginning of) a file-system path, even if pathlib.Path currently pretends that it does.
Yes :). - Koos

On Tue, Mar 29, 2016, at 12:44, Koos Zevenhoven wrote:
What about a read_json? Should the underlying request set Accept types accordingly (read_json demands text/json or application/json, read_text demands text/plain? what if I want HTML? Need another method for that?) What if content-transfer-encoding isn't set, or the data contains bytes that aren't valid for it?

On Wed, Mar 30, 2016 at 9:17 AM, Michael Selik <mike@selik.org> wrote:
How is that "even worse"? It's the exact same thing. (You might need "mkdir -p" to make this work, as it'll need to create more than one directory.) You have a directory called "ftp:", and inside that, a directory called "www.example.com". rosuav@sikorsky:~/tmp$ mkdir -p ftp://www.example.com rosuav@sikorsky:~/tmp$ find . . ./ftp: ./ftp:/www.example.com rosuav@sikorsky:~/tmp$ But this is a case where the weird edge cases can be dealt with specially, IMO. There are a *lot* of programs which cannot easily handle a file name that begins with a hyphen - the solution is to force a different interpretation, either with a double-hyphen "end of options", or by using "./-rf" as the file name. The same could be done here; in the extremely rare situation where you actually want to start your path with "http:" and then another directory, you have three options: 1) Path("./http://www.example.com") 2) Path("http:/www.example.com") 3) Path("file://http://www.example.com") For scripts that need 100% dependable parsing, the third option will be guaranteed to work. For normal usage, compressing the double slash to a single one will have the right effect AND canonicalize the name. This should be safe. ChrisA

Chris Angelico writes:
No, the third should crap out with a syntax error on the colon, see [1], which does not allow a port spec at all, and RFC 3986, which doesn't allow colon in the host name ([1] references RFC 3986 for the syntax of the host name). Specifying the host to a "file:" URI gives locally-defined behavior (eg, a Windows share), but in the most recent attempt to deal with exactly these issues[1], it is legal. The correct syntaxes per [1] and RFC 3986 are 4) Path("file:///http://www.example.com") 5) Path("file://localhost/http://www.example.com") 6) Path("file://[127.0.0.1]/http://www.example.com") 7) Path("file://[::1]/http://www.example.com") As far as I can tell the colon in "http:" is RFC 3986-legal, since it has no URI syntactic meaning in the path component. This isn't as easy as it looks (which is why people are trying to delegate it to something they think of as "simple"). There's an additional problem with trying to cram URIs and Path together, which is that in a file system, "/a/b/symlink/../c" may refer to any file system object depending on symlink's target which is unknown, while as an URI path it refers to whatever "/a/b/c" refers to, and nothing else. (This is the semantic glitch I was thinking of earlier.) This means that URIs can be canonicalized syntactically, while doing so with file system paths is risky. Footnotes: [1] https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme-06

On Wed, Mar 30, 2016 at 3:06 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Oops, my bad - I forgot about the third slash. It comes to the same thing, though; for most paths, you can deduce that a prefix "http://" implies that it's not a file path, and for the rare case when you do mean that, you can explicitly adorn it. (This is a good reminder that code and specs should not be created by one person firing off an email. This needs someone - preferably multiple people - checking the appropriate specs. Get it right, folks, don't trust me!)
Or there are two operations: canonicalizing by components, and rendering a "true path", which requires file system access (stat every component). ChrisA

Chris Angelico writes:
On Wed, Mar 30, 2016 at 3:06 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
That's not to my taste because while you do need to make that choice for filesystem paths, it's always safe (in the sense of "you're buggy, not me!") to canonicalize a URI. Also, RFC 3896 explicitly refuses to require URIs to make any physical sense, while it's an error for a filesystem path to refer to a non-existent object. In other words, URIs are abstract syntax to which you can assign whatever semantics you want (as long as they are compatible with the syntax), while filesystem paths are actual paths in a graph that is instantiated in hardware storage. I suspect you won't have a problem with that distinction, but ISTM that it's the exact opposite of the way the "let's derive Path from str" (or from "BaseString" or whatever) crowd want to think (filesystem paths are a subset of strings).

On Wed, Mar 30, 2016 at 6:42 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Yes, this is true; for a non-file-system URI, the "true path" would simply be identical to the component-based canonicalization.
Hmm. Not sure what you mean by that error - how else could you create a new file, than by identifying a path that does not exist? You're absolutely right that URIs have no specific physical meaning, yet they do have definite semantic meaning; an HTTP client can (and should) resolve relative URLs itself, rather than simply sticking "../../static/main.css" on the end of the current URL.
Even if a file system path is a special form of string, this will work. It's just that the semantics would have to be defined sans canonicalization, with two methods that do that. But I am liking the line of thinking (not posted in this exact thread, so I'm not sure who said it - sorry!) that Paths are as abstract as integers and text strings, and that they should have no particular tie to their mechanical implementation. The string "/home/rosuav/foo.py" should be more like a "path display" (by analogy with lists etc) than an actual Path. ChrisA

On 30 March 2016 at 05:23, Chris Angelico <rosuav@gmail.com> wrote:
I presume you're deliberately ignoring the fact that most paths come from variables, not from literals, and many come from user input (sometimes even unintended user input such as a "*" expanded by the shell)? It's easy enough to rewrite literals to be unambiguous, but in order to do so for arbitrary input you need to basically implement (part of) a URI parser... Paul

On Wed, Mar 30, 2016 at 6:55 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Not quite; what I'm saying is that *any* file path can be made unambiguous by prepending "file:///", thus guaranteeing that it will be parsed correctly. I'm not sure that this is necessarily a good thing, but it's in response to the objection that a magic prefix "http://" would introduce an impossibility. ChrisA

On 30 March 2016 at 11:07, Chris Angelico <rosuav@gmail.com> wrote:
I don't know if that's true for Windows paths. You'd need to switch backslashes to slashes, for a start (and *not* do that on Unix, where backslash is a valid filename character, albeit a silly one to use). And the URL syntax for drive letters is at best inconsistent across applications, and at worst undefined (I don't know if the standards define it, but if they do, I do know that not all applications follow the same rules). Paul

Paul Moore writes:
AFAICS it only matters for the file scheme, for which see https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme. That draft describes how several apps handle drive letters.

On Wed, Mar 30, 2016 at 10:55 AM, Paul Moore <p.f.moore@gmail.com> wrote:
If a user input (or config file) says "http://www.example.com/page.html", what does the user want? The http URL or the relative file-system path "http:/www.example.com/page.html"? How often are users *explicitly* asked for a *file-system path*? And if they notice that 'http://...' does not by default refer to the file system, who will complain? Maybe someone can imagine a realistic situation where interpreting "scheme://..." interpreted as an URI causes problems? - Koos

On Wed, Mar 30, 2016 at 7:06 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote: [...]
Even if correct, these do not refer to "http:/www.example.com", but to "/http:/www.example.com". An URI with a relative path would not make a lot of sense, because its meaning would depend on the context, which is against. Then again, all file system paths are 'relative' with respect to the file system you are working in. Also, while RFC 3986 is not super clear about this, I think '//' inside a URI path component may cause problems. IIUC this leads to a zero-length path segment '' in between the two slashes. It might work though if it it just gets passed forward to the file system in the end. I don't know if that can 'officially' be normalized to a single slash though. "URIs that identify in relation to the end-user's local context should only be used when the context itself is a defining aspect of the resource, such as when an on-line help manual refers to a file on the end- user's file system (e.g., "file:///etc/hosts")." - RFC 3986
As far as I can tell the colon in "http:" is RFC 3986-legal, since it has no URI syntactic meaning in the path component.
That's right; per RFC 3986, colons are allowed in a URI path component, even if it is disallowed in *the first path segment* of a *relative reference*, which I assume is to make relative references unambiguous as *URI references* which can be URIs or relative references. That is, a URI reference "mailto:email@address.com" is a mailto-URL and not a relative reference equivalent to "./mailto:email@address.com". So basically, if you want to express the (ridiculous) path 'http:/www.example.com' as a relative reference, you'd need to do './http:/www.example.com'.
This is an interesting issue, because the behavior is not implemented consistently: k7hoven@pomelo ~ % mkdir -p foo/bar k7hoven@pomelo ~ % ln -s foo/bar baz k7hoven@pomelo ~ % cd baz/.. k7hoven@pomelo ~ % cd baz k7hoven@pomelo ~/baz % cd .. k7hoven@pomelo ~ % echo "am I in foo/ or in ~/ ?" > baz/../question.txt k7hoven@pomelo ~ % cat question.txt cat: question.txt: No such file or directory k7hoven@pomelo ~ % cat foo/question.txt am I in foo/ or in ~/ ?
This means that URIs can be canonicalized syntactically, while doing so with file system paths is risky.
And that URI normalization should not be done automatically, especially if it is not clear if it's an URI or not. Then sometimes you also want to do scheme-specific normalization. -Koos
participants (10)
-
Chris Angelico
-
Ethan Furman
-
Greg Ewing
-
Koos Zevenhoven
-
Michael Selik
-
Michel Desmoulin
-
Paul Moore
-
Random832
-
Stephen J. Turnbull
-
Sven R. Kunze