[New-bugs-announce] [issue43936] os.path.realpath() normalizes paths before resolving links on Windows
report at bugs.python.org
Sat Apr 24 17:22:05 EDT 2021
New submission from Barney Gale <barney.gale at gmail.com>:
Capturing a write-up by eryksun on GitHub into a new bug.
> `nt._getfinalpathname()` opens a handle to a file/directory with `CreateFileW()` and calls `GetFinalPathNameByHandleW()`. The latter makes a few system calls to get the final opened path in the filesystem (e.g. "\Windows\explorer.exe") and the canonical DOS name of the volume device on which the filesystem is mounted (e.g. "\Device\HarddiskVolume2" -> "\\?\C:") in order to return a canonical DOS path (e.g. "\\?\C:\Windows\explorer.exe").
> Opening a handle with `CreateFileW()` entails first getting a fully-qualified and normalized NT path, which, among other things, entails resolving ".." components naively in the path string. This does not take reparse points such as symlinks and mountpoints into account. The only time Windows parses ".." components in an opened path the way POSIX does is in the kernel when they're in the target path of a relative symlink.
> `nt.readlink()` opens a handle to the file with the flag `FILE_FLAG_OPEN_REPARSE_POINT`. If the final path component is a reparse point, it opens it instead of traversing it. Then the reparse point is read with the filesystem control request, `FSCTL_GET_REPARSE_POINT`. System symlinks and mountpoints (`IO_REPARSE_TAG_SYMLINK` and `IO_REPARSE_TAG_MOUNT_POINT`) are the only supported name-surrogate reparse-point types, though `os.stat()` and `os.lstat()` handle all name-surrogate types as 'links'. Moreover, only symlinks get the `S_IFLNK` mode flag in a stat result, because they're the only ones we can create with `os.symlink()` to satisfy the usage `if os.path.islink(src): os.symlink(os.readlink(src), dst)`.
> > What would it take to do a POSIX-style "normalize as we resolve",
> > and would we want to? I guess we'd need to call nt._getfinalpathname()
> > on each path component in turn (C:, C:\Users, C:\Users\Barney etc),
> > which from my pretty basic Windows knowledge might be rather slow if
> > that involves file handles.
> You asked, so I decided to write up an outline of what implementing a POSIX-style `realpath()` might look like in Windows. At its core, it's similar to POSIX: lstat(), and, for a symlink, readlink() and recur. The equivalent calls in Windows are the following:
> * `CreateFileW()` (open a handle)
> * `GetFileInformationByHandleEx()`: `FileAttributeTagInfo`
> * `DeviceIoControl()`: `FSCTL_GET_REPARSE_POINT`
> A symlink has the reparse tag `IO_REPARSE_TAG_SYMLINK`.
> Filesystem mountpoints (aka junctions, which are like Unix bind mountpoints) must be retained in the resolved path in order to correctly resolve relative symlinks such as "\spam" (relative to the resolved device) and "..\..\spam". Anyway, this is consistent with the UNC case, since mountpoints on a remote server can never be resolved (i.e. a final UNC path never resolves mountpoints).
> Here are some of the notable differences compared to POSIX:
> * If the source path is not a "\\?\" verbatim path, `GetFullPathNameW()` must be called initially. However, ".." components in the target path of a relative symlink must be resolved the POSIX way, else symlinks in the target path may be removed incorrectly before their target is resolved (e.g. "foo\symlink\..\bar" incorrectly resolved as "foo\bar"). The opened path is initially normalized as follows:
> * replace forward slashes with backslashes
> * collapse repeated backslashes (except the UNC root must have exactly two backslashes)
> * resolve a relative path (e.g. "spam"), drive-relative path (e.g. "Z:spam"), or rooted path (e.g. "\spam") as a fully-qualified path (e.g. "Z:\eggs\spam")
> * resolve "." and ".." components in the opened path (naive to symlinks)
> * strip trailing spaces and dots from the final component (e.g. "C:\spam. . ." -> "C:\spam")
> * resolve reserved device names in the final component of a non-UNC path (e.g. "C:\nul" -> "\\.\nul")
> * Substitute drives (e.g. created by "subst.exe", or `DefineDosDeviceW`) and mapped drives (e.g. created by "net.exe", or `WNetAddConnection2W`) must be resolved, respectively via `QueryDosDeviceW()` and `WNetGetUniversalNameW()`. Like all DOS 'devices', these drives are implemented as object symlinks (i.e. symlinks in the object namespace, not to be confused with filesystem symlinks). The target path of these drives, however, is not a Device object, but rather a filesystem path on a device that can include any number of path components, some of which may be filesystem symlinks that need to be resolved. Normally when a path is opened, the system object manager reparses all DOS 'devices' to the path of an actual Device object, or a path on a Device object, before the I/O manager's parse routine ever sees the path. Such drives need to be resolved whenever parsing starts or restarts at a drive, but the result can be cached in case multiple filesystem symlinks target the same drive.
> * Substitute drives can target paths on other substitute drives, so `QueryDosDeviceW()` has to be called in a loop that accumulates the tail path components until it reaches a real device (i.e. a target path that doesn't begin with "\??\").
> * `WNetGetUniversalNameW()` has to be called after resolving substitute drives. It resolves the underlying UNC path of a mapped drive. The target path of the object symlink that implements a mapped drive is of the form "\Device\<redirector device name>\;<something>\server\share\some\filesystem\path". The "redirector device name" component is usually (post Windows Vista) an object symlink to a path on the system's Multiple UNC Provider (MUP) device, "\Device\Mup". The mapped-drive target path ultimately resolves to a redirected filesystem that's mounted in the MUP device namespace at the "share" name. This is an implementation detail of the filesystem redirector and MUP device, which the Multiple Provider Router (MPR) WNet API encapsulates. For example, for the mapped drive path "Z:\spam\eggs", it returns a UNC path of the form "\\server\share\some\filesystem\path\spam\eggs".
> * A join that tries to resolve ".." against the drive or share root path must fail, whereas this is ignored for the root path in POSIX. For example, `symlink_join("C:\\", "..\\spam")` must fail, since the system would fail an open that tried to reparse that symlink target.
> * At the end, the resolved path should be tested to try to remove "\\?\" if the source path didn't have this prefix. Call `GetFullPathNameW()` to check for a reserved name in the final component and `PathCchCanonicalizeEx()` to check for long-path support. (The latter calls the system runtime library function `RtlAreLongPathsEnabled`, but that's an undocumented implementation detail.)
> `GetFinalPathNameByHandleW()` is not required. Optionally, it can be called for the last valid component if the caller wants a final path with all mountpoints resolved, i.e. add a `final_path=False` option. Of course, a final UNC path must retain mountpoints, so there's nothing we can do in that case. It's fine that this `realpath()` implementation would return a path that contains mountpoints in Windows (as the current implementation also does for UNC paths). They are not symlinks, and this matches the behavior of POSIX.
> I'd include a warning in the documentation that getting a final path via `GetFinalPathNameByHandleW()` in the non-strict case may be dysfunctional. The unresolved tail end of the path may become valid again if a server or device comes back online. If the unresolved part contains symlinks with relative targets such as "\spam" and "..\..\spam", and the `realpath()` call resolved away mountpoints, the reminaing path may not resolve correctly against the final path, as compared to how it would resolve against the original path. It definitely will not resolve the same for a rooted target path such as "\spam" if the last resolved reparse point in the original path was a mountpoint, since it will reparse to the root path of the mountpoint device instead of the original opened device, or instead of the last resolved device of a symlink in the path.
components: Library (Lib)
title: os.path.realpath() normalizes paths before resolving links on Windows
versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 3.9
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce