joining paths without worrying about a leading slash
[this is a continuation of https://bugs.python.org/issue44452] pathlib.Path() has a concatenation operator "/" that allows the right-hand-side argument to be an absolute path, which causes the left-hand-side argument to be ignored:
pathlib.Path('/foo') / '/bar' PosixPath('/bar') pathlib.Path('/var/tmp/instroot') / '/some/path' / '/suffix' PosixPath('/suffix')
This follows the precedent set by os.path.join(), and probably makes sense in the scenario of simulating a user typing 'cd' commands in a shell. But it doesn't work nicely in the case of combining paths from two different "namespaces", where we never want to go "up". For example: a web server takes an URL, strips the host, and wants to look up a file: https://example.com/some/path → "/some/path" → /src/www/root + /some/path → /src/www/root/some/path or we are constructing a container image and need to refer to a file in the container: <container foo> + /etc/shadow → /var/lib/machines/foo + /etc/shadow → /var/lib/machines/foo/etc/shadow To do this kind of operation correctly with pathlib.Path, the user needs to do two operations: verify that the rhs argument contains no '..' [*], and strip leading slashes:
lhs = pathlib.Path('/some/namespace/') rhs = '/some/path/to/add' if '..' in pathlib.Path(rhs).parts: raise ValueError path = lhs / rhs.lstrip('/')
Those last two lines are rather verbose, non-obvious. Also the .lstrip() operation attaches on the right side, but operates on the left side, earlier than the "/", which is overall not very nice. Proposal: add "//"-operator to pathlib.PosixPath() that means "concatenate a rhs path that is underneath the lhs". It would disallow paths with '..', and concatenate paths as relative to the specified lhs:
lhs = pathlib.Path('/some/namespace/') lhs // "a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "/a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "a/../b/c" ValueError: cannot use // with a path with '..' on the right
This would be useful for operations on containers, combining paths from namespaces like fs paths and URL components, looking up files underneath an unpacked archive, etc. [*] Why completely disallow '..' ? Components with '..' cannot be correctly resolved without access to the filesystem, because a component may be a symlink, and then "a/b/../." may not be "a/.", but something completely different. Thus, since the goal is to have a path underneath lhs, I think it's best to forbid '..'. In principle '..' at the beginning can be resolved reliably, by simply ignoring it, '/../../../whatever' is the same as '/whatever/'. But it's a tiny corner case, and I think it's better to disallow that too. Zbyszek
On 27 Jun 2021, at 12:07, Zbigniew Jędrzejewski-Szmek
wrote: [this is a continuation of https://bugs.python.org/issue44452]
pathlib.Path() has a concatenation operator "/" that allows the right-hand-side argument to be an absolute path, which causes the left-hand-side argument to be ignored:
pathlib.Path('/foo') / '/bar' PosixPath('/bar') pathlib.Path('/var/tmp/instroot') / '/some/path' / '/suffix' PosixPath('/suffix')
This follows the precedent set by os.path.join(), and probably makes sense in the scenario of simulating a user typing 'cd' commands in a shell.
But it doesn't work nicely in the case of combining paths from two different "namespaces", where we never want to go "up".
For example: a web server takes an URL, strips the host, and wants to look up a file: https://example.com/some/path → "/some/path" → /src/www/root + /some/path → /src/www/root/some/path
or we are constructing a container image and need to refer to a file in the container: <container foo> + /etc/shadow → /var/lib/machines/foo + /etc/shadow → /var/lib/machines/foo/etc/shadow
To do this kind of operation correctly with pathlib.Path, the user needs to do two operations: verify that the rhs argument contains no '..' [*], and strip leading slashes:
lhs = pathlib.Path('/some/namespace/') rhs = '/some/path/to/add' if '..' in pathlib.Path(rhs).parts: raise ValueError path = lhs / rhs.lstrip('/')
Those last two lines are rather verbose, non-obvious. Also the .lstrip() operation attaches on the right side, but operates on the left side, earlier than the "/", which is overall not very nice.
Proposal:
add "//"-operator to pathlib.PosixPath() that means "concatenate a rhs path that is underneath the lhs". It would disallow paths with '..', and concatenate paths as relative to the specified lhs:
lhs = pathlib.Path('/some/namespace/') lhs // "a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "/a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "a/../b/c" ValueError: cannot use // with a path with '..' on the right
This would be useful for operations on containers, combining paths from namespaces like fs paths and URL components, looking up files underneath an unpacked archive, etc.
[*] Why completely disallow '..' ? Components with '..' cannot be correctly resolved without access to the filesystem, because a component may be a symlink, and then "a/b/../." may not be "a/.", but something completely different. Thus, since the goal is to have a path underneath lhs, I think it's best to forbid '..'. In principle '..' at the beginning can be resolved reliably, by simply ignoring it, '/../../../whatever' is the same as '/whatever/'. But it's a tiny corner case, and I think it's better to disallow that too.
There are two ideas here. 1. Allow Path() to join a pair of absolute paths. 2. Prevent '..' from escaping into the first absolute path. For (1) you can do this today:
root=Path('/var/www') root / y.relative_to('/') PosixPath('/var/www/a/b')
I can think if a number of rules that might apply for (2). (a) raise an error is there is a '..' or '.' in any path component. (b) resolve() '..' and ',' as pathlib already does - I'm not sure that use of the filesystem is needed to validate the use of .. is always needed.
y=Path('/a/b/../v.html') y.relative_to('/') PosixPath('a/b/../v.html') root / y.relative_to('/') PosixPath('/var/www/a/b/../v.html') root / y.resolve().relative_to('/') PosixPath('/var/www/a/v.html')
and show that no escape to root happens:
y=Path('/../a//v.html') root / y.resolve().relative_to('/') PosixPath('/var/www/a/v.html')
Barry
Zbyszek _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IXYPKV... Code of Conduct: http://python.org/psf/codeofconduct/
"[Python-ideas] Sanitize filename (path part) 2nd try"
https://mail.python.org/archives/list/python-ideas@python.org/thread/LRIKMG3...
"[Python-ideas] Sanitize filename (path part)"
https://mail.python.org/archives/list/python-ideas@python.org/thread/SQH4LPE...
```quote
What does sanitizepart do with a leading slash?
assert os.path.join("a", "/b") == "/b"
A new safejoin() or joinsafe() or join(safe='True') could call
sanitizepart() such that:
assert joinsafe("a\n", "/b") == "a\n/b"
```
On Sun, Jun 27, 2021, 17:15 Barry Scott
On 27 Jun 2021, at 12:07, Zbigniew Jędrzejewski-Szmek
wrote: [this is a continuation of https://bugs.python.org/issue44452]
pathlib.Path() has a concatenation operator "/" that allows the right-hand-side argument to be an absolute path, which causes the left-hand-side argument to be ignored:
pathlib.Path('/foo') / '/bar' PosixPath('/bar') pathlib.Path('/var/tmp/instroot') / '/some/path' / '/suffix' PosixPath('/suffix')
This follows the precedent set by os.path.join(), and probably makes sense in the scenario of simulating a user typing 'cd' commands in a shell.
But it doesn't work nicely in the case of combining paths from two different "namespaces", where we never want to go "up".
For example: a web server takes an URL, strips the host, and wants to look up a file: https://example.com/some/path → "/some/path" → /src/www/root + /some/path → /src/www/root/some/path
or we are constructing a container image and need to refer to a file in the container: <container foo> + /etc/shadow → /var/lib/machines/foo + /etc/shadow → /var/lib/machines/foo/etc/shadow
To do this kind of operation correctly with pathlib.Path, the user needs to do two operations: verify that the rhs argument contains no '..' [*], and strip leading slashes:
lhs = pathlib.Path('/some/namespace/') rhs = '/some/path/to/add' if '..' in pathlib.Path(rhs).parts: raise ValueError path = lhs / rhs.lstrip('/')
Those last two lines are rather verbose, non-obvious. Also the .lstrip() operation attaches on the right side, but operates on the left side, earlier than the "/", which is overall not very nice.
Proposal:
add "//"-operator to pathlib.PosixPath() that means "concatenate a rhs path that is underneath the lhs". It would disallow paths with '..', and concatenate paths as relative to the specified lhs:
lhs = pathlib.Path('/some/namespace/') lhs // "a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "/a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "a/../b/c" ValueError: cannot use // with a path with '..' on the right
This would be useful for operations on containers, combining paths from namespaces like fs paths and URL components, looking up files underneath an unpacked archive, etc.
[*] Why completely disallow '..' ? Components with '..' cannot be correctly resolved without access to the filesystem, because a component may be a symlink, and then "a/b/../." may not be "a/.", but something completely different. Thus, since the goal is to have a path underneath lhs, I think it's best to forbid '..'. In principle '..' at the beginning can be resolved reliably, by simply ignoring it, '/../../../whatever' is the same as '/whatever/'. But it's a tiny corner case, and I think it's better to disallow that too.
There are two ideas here.
1. Allow Path() to join a pair of absolute paths.
2. Prevent '..' from escaping into the first absolute path.
For (1) you can do this today:
root=Path('/var/www') root / y.relative_to('/') PosixPath('/var/www/a/b')
I can think if a number of rules that might apply for (2). (a) raise an error is there is a '..' or '.' in any path component. (b) resolve() '..' and ',' as pathlib already does
- I'm not sure that use of the filesystem is needed to validate the use of .. is always needed.
y=Path('/a/b/../v.html') y.relative_to('/') PosixPath('a/b/../v.html') root / y.relative_to('/') PosixPath('/var/www/a/b/../v.html') root / y.resolve().relative_to('/') PosixPath('/var/www/a/v.html')
and show that no escape to root happens:
y=Path('/../a//v.html') root / y.resolve().relative_to('/') PosixPath('/var/www/a/v.html')
Barry
Zbyszek _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/IXYPKV...
Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6HR4IA... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Jun 27, 2021 at 10:13:25PM +0100, Barry Scott wrote:
On 27 Jun 2021, at 12:07, Zbigniew Jędrzejewski-Szmek
wrote: [this is a continuation of https://bugs.python.org/issue44452]
pathlib.Path() has a concatenation operator "/" that allows the right-hand-side argument to be an absolute path, which causes the left-hand-side argument to be ignored:
pathlib.Path('/foo') / '/bar' PosixPath('/bar') pathlib.Path('/var/tmp/instroot') / '/some/path' / '/suffix' PosixPath('/suffix')
This follows the precedent set by os.path.join(), and probably makes sense in the scenario of simulating a user typing 'cd' commands in a shell.
But it doesn't work nicely in the case of combining paths from two different "namespaces", where we never want to go "up".
For example: a web server takes an URL, strips the host, and wants to look up a file: https://example.com/some/path → "/some/path" → /src/www/root + /some/path → /src/www/root/some/path
or we are constructing a container image and need to refer to a file in the container: <container foo> + /etc/shadow → /var/lib/machines/foo + /etc/shadow → /var/lib/machines/foo/etc/shadow
To do this kind of operation correctly with pathlib.Path, the user needs to do two operations: verify that the rhs argument contains no '..' [*], and strip leading slashes:
lhs = pathlib.Path('/some/namespace/') rhs = '/some/path/to/add' if '..' in pathlib.Path(rhs).parts: raise ValueError path = lhs / rhs.lstrip('/')
Those last two lines are rather verbose, non-obvious. Also the .lstrip() operation attaches on the right side, but operates on the left side, earlier than the "/", which is overall not very nice.
Proposal:
add "//"-operator to pathlib.PosixPath() that means "concatenate a rhs path that is underneath the lhs". It would disallow paths with '..', and concatenate paths as relative to the specified lhs:
lhs = pathlib.Path('/some/namespace/') lhs // "a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "/a/b/c" PosixPath('/some/namespace/a/b/c') lhs // "a/../b/c" ValueError: cannot use // with a path with '..' on the right
This would be useful for operations on containers, combining paths from namespaces like fs paths and URL components, looking up files underneath an unpacked archive, etc.
[*] Why completely disallow '..' ? Components with '..' cannot be correctly resolved without access to the filesystem, because a component may be a symlink, and then "a/b/../." may not be "a/.", but something completely different. Thus, since the goal is to have a path underneath lhs, I think it's best to forbid '..'. In principle '..' at the beginning can be resolved reliably, by simply ignoring it, '/../../../whatever' is the same as '/whatever/'. But it's a tiny corner case, and I think it's better to disallow that too.
There are two ideas here.
1. Allow Path() to join a pair of absolute paths.
2. Prevent '..' from escaping into the first absolute path.
For (1) you can do this today:
root=Path('/var/www') root / y.relative_to('/') PosixPath('/var/www/a/b')
I don't like it because: 1. 'y' must be a Path object too. The most natural use of such concatenation is to have some "root" object which is converted to Path() once, and then passed around to functions. So this would have to look more like this: root / pathlib.Path(y).relative_to('/') 2. It's another non-obvious and verbose workaround 3. It doesn't work in all cases:
pathlib.Path('///a/b/c').relative_to('/') PosixPath('a/b/c') pathlib.Path('//a/b/c').relative_to('/') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.9/pathlib.py", line 929, in relative_to raise ValueError("{!r} is not in the subpath of {!r}" ValueError: '//a/b/c' is not in the subpath of '/' OR one path is relative and the other is absolute. pathlib.Path('/a/b/c').relative_to('/') PosixPath('a/b/c')
This is POSIX, btw, see https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag.... 4. It also doesn't handle '..' in any way:
pathlib.Path('/../foo').relative_to('/') PosixPath('../foo')
I can think if a number of rules that might apply for (2). (a) raise an error is there is a '..' or '.' in any path component.
Let's talk about '.' first. '.' is not a problem at all. It simply *doesn't matter* for any correctness or security rules that are normally used. One might want to remove '.' from the string before printing as a prettification operation, but that's all.
(b) resolve() '..' and ',' as pathlib already does
- I'm not sure that use of the filesystem is needed to validate the use of .. is always needed. [...] no escape to root happens:
The main reason is not "not escaping to the root", but resolving the path correctly, in the sense of resolving it the same as kernel would, if we were to access the file: $ ls -l /bin /bin -> usr/bin/ $ ls -ld /bin/../riscv64-linux-gnu drwxr-xr-x 4 root root 4096 Jan 26 05:11 /bin/../riscv64-linux-gnu/ $ ls -ld /riscv64-linux-gnu ls: cannot access '/riscv64-linux-gnu': No such file or directory As you can see, you cannot just collapse '/bin/../' to '/', because '/bin/../' is '/usr' and '/bin/../riscv64-linux-gnu' is '/usr/riscv64-linux-gnu'. Similarly, '/var/lib/machines/rawhide/bin/../riscv64-linux-gnu/' is *not* '/var/lib/machines/rawhide/riscv64-linux-gnu/'. But actually, it also allows "escaping":
pathlib.Path('/root') / pathlib.Path('/../foo').relative_to('/') PosixPath('/root/../foo') ('/root/foo' is the what we would need instead.)
(The operation of resolving a path with '..'-components relative to some root comes up quite often when handling containers. It's a royal pain, because you need to look up one component, then if it is a symlink, resursively look for any '..'-components in the symlink target, and at each step normalize paths relative to the temporary root. Example implementation: https://github.com/systemd/systemd/blob/main/src/basic/fs-util.c#L771 .) Zbyszek
On Sun, Jun 27, 2021 at 09:55:34PM -0400, Wes Turner wrote:
"[Python-ideas] Sanitize filename (path part) 2nd try" https://mail.python.org/archives/list/python-ideas@python.org/thread/LRIKMG3...
"[Python-ideas] Sanitize filename (path part)" https://mail.python.org/archives/list/python-ideas@python.org/thread/SQH4LPE...
```quote What does sanitizepart do with a leading slash?
assert os.path.join("a", "/b") == "/b"
A new safejoin() or joinsafe() or join(safe='True') could call sanitizepart() such that:
assert joinsafe("a\n", "/b") == "a\n/b" ```
Thanks for the links. "sanitizepart()" seems to be about *constructing* a safe filename. It's a different problem and there's a thousand ways to do it. I think the idea with joinsafe() is similar to my idea... But I think the req to disallow '..' is crucial. If we set the requirements as: 1. the resulting path must not be above the lhs arg 2. the operation must be done without actually accessing the fs right now I see the proposed operation that rejects '..' as the best approach. Zbyszek
Here's this, which IIRC I never wrote tests for, which is what needs to be
done to specify the correct behavior:
```python
def pathjoin(*args, **kwargs):
"""
Arguments:
args (list): *args list of paths
if len(args) == 1, args[0] is not a string, and args[0] is
iterable,
set args to args[0].
Basically::
joined_path = u'/'.join(
[args[0].rstrip('/')] +
[a.strip('/') for a in args[1:-1]] +
[args[-1].lstrip('/')])
"""
log.debug('pathjoin: %r' % list(args))
def _pathjoin(*args, **kwargs):
len_ = len(args) - 1
if len_ < 0:
raise Exception('no args specified')
elif len_ == 0:
if not isinstance(args, basestring):
if hasattr(args, '__iter__'):
_args = args
_args
args = args[0]
for i, arg in enumerate(args):
if not i:
yield arg.rstrip('/')
elif i == len_:
yield arg.lstrip('/')
else:
yield arg.strip('/')
joined_path = u'/'.join(_pathjoin(*args))
return sanitize_path(joined_path)
def sanitize_path(path):
# XXX TODO FIXME
if '/../' in path:
raise Exception()
return path
```
https://github.com/westurner/pgs/blob/master/pgs/app.py#L60-L95
On Mon, Jun 28, 2021, 04:09 Zbigniew Jędrzejewski-Szmek
On Sun, Jun 27, 2021 at 09:55:34PM -0400, Wes Turner wrote:
"[Python-ideas] Sanitize filename (path part) 2nd try"
https://mail.python.org/archives/list/python-ideas@python.org/thread/LRIKMG3...
"[Python-ideas] Sanitize filename (path part)"
https://mail.python.org/archives/list/python-ideas@python.org/thread/SQH4LPE...
```quote What does sanitizepart do with a leading slash?
assert os.path.join("a", "/b") == "/b"
A new safejoin() or joinsafe() or join(safe='True') could call sanitizepart() such that:
assert joinsafe("a\n", "/b") == "a\n/b" ```
Thanks for the links. "sanitizepart()" seems to be about *constructing* a safe filename. It's a different problem and there's a thousand ways to do it.
I think the idea with joinsafe() is similar to my idea... But I think the req to disallow '..' is crucial. If we set the requirements as:
1. the resulting path must not be above the lhs arg 2. the operation must be done without actually accessing the fs
right now I see the proposed operation that rejects '..' as the best approach.
Zbyszek
On Mon, Jun 28, 2021 at 10:03:15AM -0400, Wes Turner wrote:
Here's this, which IIRC I never wrote tests for, which is what needs to be done to specify the correct behavior:
```python def pathjoin(*args, **kwargs): """ Arguments: args (list): *args list of paths if len(args) == 1, args[0] is not a string, and args[0] is iterable, set args to args[0].
Basically::
joined_path = u'/'.join( [args[0].rstrip('/')] + [a.strip('/') for a in args[1:-1]] + [args[-1].lstrip('/')]) """ log.debug('pathjoin: %r' % list(args))
def _pathjoin(*args, **kwargs): len_ = len(args) - 1 if len_ < 0: raise Exception('no args specified') elif len_ == 0: if not isinstance(args, basestring): if hasattr(args, '__iter__'): _args = args _args args = args[0] for i, arg in enumerate(args): if not i: yield arg.rstrip('/') elif i == len_: yield arg.lstrip('/') else: yield arg.strip('/') joined_path = u'/'.join(_pathjoin(*args)) return sanitize_path(joined_path)
def sanitize_path(path): # XXX TODO FIXME if '/../' in path: raise Exception() return path ```
https://github.com/westurner/pgs/blob/master/pgs/app.py#L60-L95
Yes, something like that should work, except that sanitize_path() misses leading or trailing '..'. When we do this is an operator, things become even simpler: class PosixPath2(PosixPath): def __floordiv__(self, other): as_path = PosixPath(other) if '..' in as_path.parts: raise ValueError("argument has a component with '..'") if as_path.is_absolute(): other = str(other).lstrip('/') return self / other Zbyszek
On Mon, Jun 28, 2021, 04:09 Zbigniew Jędrzejewski-Szmek
wrote: On Sun, Jun 27, 2021 at 09:55:34PM -0400, Wes Turner wrote:
"[Python-ideas] Sanitize filename (path part) 2nd try"
https://mail.python.org/archives/list/python-ideas@python.org/thread/LRIKMG3...
"[Python-ideas] Sanitize filename (path part)"
https://mail.python.org/archives/list/python-ideas@python.org/thread/SQH4LPE...
```quote What does sanitizepart do with a leading slash?
assert os.path.join("a", "/b") == "/b"
A new safejoin() or joinsafe() or join(safe='True') could call sanitizepart() such that:
assert joinsafe("a\n", "/b") == "a\n/b" ```
Thanks for the links. "sanitizepart()" seems to be about *constructing* a safe filename. It's a different problem and there's a thousand ways to do it.
I think the idea with joinsafe() is similar to my idea... But I think the req to disallow '..' is crucial. If we set the requirements as:
1. the resulting path must not be above the lhs arg 2. the operation must be done without actually accessing the fs
right now I see the proposed operation that rejects '..' as the best approach.
Zbyszek
participants (3)
-
Barry Scott
-
Wes Turner
-
Zbigniew Jędrzejewski-Szmek