[Python-ideas] URLs/URIs + pathlib.Path + literal syntax = ?

Koos Zevenhoven k7hoven at gmail.com
Wed Mar 30 08:46:20 EDT 2016


On Wed, Mar 30, 2016 at 7:06 AM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
[...]
> The correct syntaxes per [1] and RFC 3986 are
>
> 4)  Path("file:///http://www.example.com")
> 5)  Path("file://localhost/http://www.example.com")
> 6)  Path("file://[127.0.0.1]/http://www.example.com")
> 7)  Path("file://[::1]/http://www.example.com")
>

Even if correct, these do not refer to "http:/www.example.com", but to
"/http:/www.example.com". An URI with a relative path would not make a
lot of sense, because its meaning would depend on the context, which
is against. Then again, all file system paths are 'relative' with
respect to the file system you are working in.

Also, while RFC 3986 is not super clear about this, I think '//'
inside a URI path component may cause problems. IIUC this leads to a
zero-length path segment '' in between the two slashes. It might work
though if it it just gets passed forward to the file system in the
end. I don't know if that can 'officially' be normalized to a single
slash though.

   "URIs that
   identify in relation to the end-user's local context should only be
   used when the context itself is a defining aspect of the resource,
   such as when an on-line help manual refers to a file on the end-
   user's file system (e.g., "file:///etc/hosts")." - RFC 3986

> As far as I can tell the colon in "http:" is RFC 3986-legal, since it
> has no URI syntactic meaning in the path component.

That's right; per RFC 3986, colons are allowed in a URI path
component, even if it is disallowed in *the first path segment* of a
*relative reference*, which I assume is to make relative references
unambiguous as *URI references* which can be URIs or relative
references. That is, a URI reference "mailto:email at address.com" is a
mailto-URL and not a relative reference equivalent to
"./mailto:email at address.com".

So basically, if you want to express the (ridiculous) path
'http:/www.example.com' as a relative reference, you'd need to do
'./http:/www.example.com'.

>This isn't as
> easy as it looks (which is why people are trying to delegate it to
> something they think of as "simple").
>
> There's an additional problem with trying to cram URIs and Path
> together, which is that in a file system, "/a/b/symlink/../c" may
> refer to any file system object depending on symlink's target which is
> unknown, while as an URI path it refers to whatever "/a/b/c" refers
> to, and nothing else.  (This is the semantic glitch I was thinking of
> earlier.)

This is an interesting issue, because the behavior is not implemented
consistently:

 k7hoven at pomelo ~ % mkdir -p foo/bar
 k7hoven at pomelo ~ % ln -s foo/bar baz
 k7hoven at pomelo ~ % cd baz/..
 k7hoven at pomelo ~ % cd baz
 k7hoven at pomelo ~/baz % cd ..
 k7hoven at pomelo ~ % echo "am I in foo/ or in ~/ ?" > baz/../question.txt
 k7hoven at pomelo ~ % cat question.txt
 cat: question.txt: No such file or directory
 k7hoven at pomelo ~ % cat foo/question.txt
 am I in foo/ or in ~/ ?

> This means that URIs can be canonicalized syntactically, while doing
> so with file system paths is risky.

And that URI normalization should not be done automatically,
especially if it is not clear if it's an URI or not. Then sometimes
you also want to do scheme-specific normalization.

-Koos


More information about the Python-ideas mailing list