Mailman 3 Sanitize filename (path part) 2nd try - Python-ideas

Sanitize filename (path part) 2nd try

Steve Jorgensen

May 11, 2020

7:36 a.m.

Based on responses to my previous proposal, I am convinced that it was over-ambitious and not appropriate for inclusion in the Python standard library, so starting over with a more narrowly scoped suggestion. Proposal: Add a new function (possibly `os.path.sanitizepart`) to sanitize a value for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters. When an invalid character is encountered, then `ValueError` will be raised in the default case, or the character may be replaced or escaped. When an invalid name is encountered, then `ValueError` will be raised in the default case, or the first character may be replaced, escaped, or prefixed. Control characters (those in the Unicode general category of "C") are treated as invalid by default. After applying any transformations, if the result would still be invalid, then an exception is raised. Proposed function signature: `sanitizepart(name, replace=None, escape=None, prefix=None, flags=0)` When `replace` is supplied, it is used as a replacement for any invalid characters or for the first character of an invalid name. When `prefix` is not also supplied, this is also used as the replacement for the first character of the name if it is invalid, not simply due to containing invalid characters. When `escape` is supplied (typically "%") it is used as the escape character in the same way that "%" is used in URL encoding. When a non-ASCII character is escaped, it is represented as a sequence of encoded bytes/octets. When `prefix` is not also supplied, this is also used to escape the first character of the name if it is invalid, not simply due to containing invalid characters. `replace` and `escape` are mutually exclusive. When `prefix` is supplied (typically "_"), it is prepended the name if it is invalid, not simply due to containing invalid characters. Flags: - path.PERMIT_RELATIVE (1): Permit relative path values ("." "..") - path.PERMIT_CTRL (2): Permit characters in the Unicode general category of "C".

Show replies by date

Steve Jorgensen

May 2020

8:50 a.m.

Steve Jorgensen wrote: <snip>

...

When escape is supplied (typically "%") it is used as the escape character in the same way that "%" is used in URL encoding. When a non-ASCII character is escaped, it is represented as a sequence of encoded bytes/octets.

I neglected to say that the octet sequence would be for the UTF-8 representation of the non-ASCII character. This is consistent with ECMAScript's `encodeURI` (see https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3). Also, to clarify why this is needed, it is for when there are non-ASCII control characters such as \u2066 (Left-to-Right Isolate) in the given name value and control characters are not being allowed. Other non-ASCII Unicode characters are permitted, so this is not applicable to those.

Steve Jorgensen

4:39 p.m.

Steve Jorgensen wrote:

...

Based on responses to my previous proposal, I am convinced that it was over-ambitious and not appropriate for inclusion in the Python standard library, so starting over with a more narrowly scoped suggestion. Proposal: Add a new function (possibly os.path.sanitizepart) to sanitize a value for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters. When an invalid character is encountered, then ValueError will be raised in the default case, or the character may be replaced or escaped. When an invalid name is encountered, then ValueError will be raised in the default case, or the first character may be replaced, escaped, or prefixed. Control characters (those in the Unicode general category of "C") are treated as invalid by default. After applying any transformations, if the result would still be invalid, then an exception is raised. Proposed function signature: sanitizepart(name, replace=None, escape=None, prefix=None, flags=0) When replace is supplied, it is used as a replacement for any invalid characters or for the first character of an invalid name. When prefix is not also supplied, this is also used as the replacement for the first character of the name if it is invalid, not simply due to containing invalid characters. When escape is supplied (typically "%") it is used as the escape character in the same way that "%" is used in URL encoding. When a non-ASCII character is escaped, it is represented as a sequence of encoded bytes/octets. When prefix is not also supplied, this is also used to escape the first character of the name if it is invalid, not simply due to containing invalid characters. replace and escape are mutually exclusive. When prefix is supplied (typically "_"), it is prepended the name if it is invalid, not simply due to containing invalid characters. Flags:

path.PERMIT_RELATIVE (1): Permit relative path values ("." "..") path.PERMIT_CTRL (2): Permit characters in the Unicode general category of "C".

Somewhere between the 1st and 2nd proposal, I lost track of the system-specificity issue. Even with this more focused proposal, there is the issue of different path separators on Windows vs *nix, so the function needs another argument for that. Presumably, it would have a default of `None` meaning to use the current platform and would have constants for `NIX`, `WIN`, and `GENERAL` where `WIN` and `GENERAL` behave the same, recognizing either "/" or "\" as a file separator character.

Andrew Barnert

5:09 p.m.

On May 11, 2020, at 00:40, Steve Jorgensen <stevej@stevej.name> wrote:

...

Proposal:

Add a new function (possibly `os.path.sanitizepart`) to sanitize a value for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters.

“Also” in addition to what? Are there other requirements enforced besides these two that aren’t specified anywhere? If not: the result can contain the path separator, illegal characters that aren’t control characters, nonprinting characters that aren’t control characters, and characters whose bytes (in the filesystem’s encoding) are ASCII control characters? And it can be a reserved name, or even something like C:; as long as it’s not the Unix . or ..? What’s the use case where you need to sanitize these things but nothing else? As I said on the previous proposal, I have had a variety of times where I needed to sanitize filenames, but I don’t think this would have been what I wanted for _any_ of them, much less for most. Are there existing tools, libraries, recommendations, etc. that this is based on, or is it just an educated guess at what’s important? For something that’s meant to go into the stdlib with a name that strongly implies “if you use this, you’re safe from stupid or malicious filenames”, it would be misleading, and possibly dangerous, if it didn’t actually make you safe because it didn’t catch common mistakes/exploits that everyone else considers important to catch. And without any cites to what people everyone else considers important, why should anyone trust that this proposal isn’t missing, or getting wrong, anything critical? Why isn’t this also available in pathlib? Is it the kind of thing you don’t envision high-level pathlib-style code ever needing to do, only low-level os-style code?

...

When `replace` is supplied, it is used as a replacement for any invalid characters or for the first character of an invalid name. When `prefix` is not also supplied, this is also used as the replacement for the first character of the name if it is invalid, not simply due to containing invalid characters.

What’s the use case for separate prefix and replace? Or just for prefix in the first place?

...

When `escape` is supplied (typically "%") it is used as the escape character in the same way that "%" is used in URL encoding.

Why allow other escape strings? Has anyone ever wanted URL-encoding but with some other string in place or %, in this or any other context? The escape character is not itself escaped? More generally, what’s the use case for %-encoding filenames like this? Are people expecting it to interact transparently with URLs, so if I save a file “spam\0eggs” in a Python script and then try to browse to file:///spam\0eggs” in a browser, the browser will convert the \0 character to %00 the same way my Python script did and therefore find the file? If so, doesn’t it need to escape all the same characters that URLs do, not a different set? If not, isn’t using something similar to URL-encoding but not identical just going to confuse people rather than help then? What happens if you supply a string longer than one character as escape? Or replace or prefix, for that matter? Overall, it seems like there is a problem to be solved, but I don’t see any reason to be confident that this is the solution for anyone, and if it’s not the solution for _most_ people, adding it to the stdlib will just mean people don’t search for and find the right one, all the while misleading themselves into thinking they’re safe when they’re not, which will make the overall problem worse, not better.

Wes Turner

7:54 p.m.

What does sanitizepart do with newlines \n \r \r\n in filenames? Are these control characters? What does sanitizepart do with a leading slash? assert os.path.join("a", "/b") == "/b" A new safejoin() or joinsafe() or join(safe='True') could call sanitizepart() such that: assert joinsafe("a\n", "/b") == "a\\n/b" On Mon, May 11, 2020, 1:11 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:

...

On May 11, 2020, at 00:40, Steve Jorgensen <stevej@stevej.name> wrote:

...
Proposal:

Add a new function (possibly `os.path.sanitizepart`) to sanitize a value

for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters.

“Also” in addition to what? Are there other requirements enforced besides these two that aren’t specified anywhere?

If not: the result can contain the path separator, illegal characters that aren’t control characters, nonprinting characters that aren’t control characters, and characters whose bytes (in the filesystem’s encoding) are ASCII control characters?

And it can be a reserved name, or even something like C:; as long as it’s not the Unix . or ..?

What’s the use case where you need to sanitize these things but nothing else? As I said on the previous proposal, I have had a variety of times where I needed to sanitize filenames, but I don’t think this would have been what I wanted for _any_ of them, much less for most.

Are there existing tools, libraries, recommendations, etc. that this is based on, or is it just an educated guess at what’s important? For something that’s meant to go into the stdlib with a name that strongly implies “if you use this, you’re safe from stupid or malicious filenames”, it would be misleading, and possibly dangerous, if it didn’t actually make you safe because it didn’t catch common mistakes/exploits that everyone else considers important to catch. And without any cites to what people everyone else considers important, why should anyone trust that this proposal isn’t missing, or getting wrong, anything critical?

Why isn’t this also available in pathlib? Is it the kind of thing you don’t envision high-level pathlib-style code ever needing to do, only low-level os-style code?

...
When `replace` is supplied, it is used as a replacement for any invalid characters or for the first character of an invalid name. When `prefix` is not also supplied, this is also used as the replacement for the first character of the name if it is invalid, not simply due to containing invalid characters.

What’s the use case for separate prefix and replace? Or just for prefix in the first place?

...
When `escape` is supplied (typically "%") it is used as the escape character in the same way that "%" is used in URL encoding.

Why allow other escape strings? Has anyone ever wanted URL-encoding but with some other string in place or %, in this or any other context?

The escape character is not itself escaped?

More generally, what’s the use case for %-encoding filenames like this? Are people expecting it to interact transparently with URLs, so if I save a file “spam\0eggs” in a Python script and then try to browse to file:///spam\0eggs” in a browser, the browser will convert the \0 character to %00 the same way my Python script did and therefore find the file? If so, doesn’t it need to escape all the same characters that URLs do, not a different set? If not, isn’t using something similar to URL-encoding but not identical just going to confuse people rather than help then?

What happens if you supply a string longer than one character as escape? Or replace or prefix, for that matter?

Overall, it seems like there is a problem to be solved, but I don’t see any reason to be confident that this is the solution for anyone, and if it’s not the solution for _most_ people, adding it to the stdlib will just mean people don’t search for and find the right one, all the while misleading themselves into thinking they’re safe when they’re not, which will make the overall problem worse, not better.

_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZBBMQ3... Code of Conduct: http://python.org/psf/codeofconduct/

Andrew Barnert

9:40 p.m.

On May 11, 2020, at 12:54, Wes Turner <wes.turner@gmail.com> wrote:

...

What does sanitizepart do with newlines \n \r \r\n in filenames? Are these control characters?

>>> unicodedata.category('\n') Cc

Barry Scott

7:59 p.m.

...

On 11 May 2020, at 18:09, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:

More generally, what’s the use case for %-encoding filenames like this? Are people expecting it to interact transparently with URLs, so if I save a file “spam\0eggs” in a Python script and then try to browse to file:///spam\0eggs <file:///spam/0eggs>” in a browser, the browser will convert the \0 character to %00 the same way my Python script did and therefore find the file?

No. The \0 can never be part of a valid file in Unix, macOS or Windows. Barry

Andrew Barnert

9:38 p.m.

On May 11, 2020, at 12:59, Barry Scott <barry@barrys-emacs.org> wrote:

...

...
On 11 May 2020, at 18:09, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:

More generally, what’s the use case for %-encoding filenames like this? Are people expecting it to interact transparently with URLs, so if I save a file “spam\0eggs” in a Python script and then try to browse to file:///spam\0eggs” in a browser, the browser will convert the \0 character to %00 the same way my Python script did and therefore find the file?

No.

The \0 can never be part of a valid file in Unix, macOS or Windows.

Of course. Which is exactly the kind of thing this sanitize function is meant for. Hence my question: if my Python script is sanitizing all filenames with this function with escape='%', is the expectation that it’ll actually give me something that can be used if I paste the same thing into a browser and let it url-escape a file URL? If so, will that actually work? If not, what _is_ the intended use for this option?

Barry Scott

8:32 a.m.

...

On 11 May 2020, at 22:38, Andrew Barnert <abarnert@yahoo.com> wrote:

On May 11, 2020, at 12:59, Barry Scott <barry@barrys-emacs.org> wrote:

...
...
On 11 May 2020, at 18:09, Andrew Barnert via Python-ideas <python-ideas@python.org <mailto:python-ideas@python.org>> wrote:

More generally, what’s the use case for %-encoding filenames like this? Are people expecting it to interact transparently with URLs, so if I save a file “spam\0eggs” in a Python script and then try to browse to file:///spam\0eggs <file:///spam/0eggs>” in a browser, the browser will convert the \0 character to %00 the same way my Python script did and therefore find the file?

No.

The \0 can never be part of a valid file in Unix, macOS or Windows.

Of course. Which is exactly the kind of thing this sanitize function is meant for.

Hence my question: if my Python script is sanitizing all filenames with this function with escape='%', is the expectation that it’ll actually give me something that can be used if I paste the same thing into a browser and let it url-escape a file URL? If so, will that actually work? If not, what _is_ the intended use for this option?

I misunderstood I thought you where saying that using escaping allowed bad chars to work. Barry

Steven D'Aprano

10:52 a.m.

On Mon, May 11, 2020 at 08:59:42PM +0100, Barry Scott wrote:

...

The \0 can never be part of a valid file in Unix, macOS or Windows.

There are a few file systems which accept NULs in file names, such as HFS and HFS+ and (I think) Joliet. HFS+ volumes include a special special directory called the metadata directory, in the volume's root directory, called "\0\0\0\0HFS+ Private Data". https://developer.apple.com/library/archive/technotes/tn/tn1150.html#HFSPlus... I don't know how complete HFS+ support is on Linux or Windows, but in principle any OS that supports HFS+ or (maybe) Joliet could have files with NULs. Remember that NULs may be legal next time you are stress testing your file IO code *wink* -- Steven

Antoine Pitrou

1:11 p.m.

On Wed, 13 May 2020 20:52:38 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

...

I don't know how complete HFS+ support is on Linux or Windows, but in principle any OS that supports HFS+ or (maybe) Joliet could have files with NULs.

Remember that NULs may be legal next time you are stress testing your file IO code *wink*

NULs may be theoretically legal on your filesystem of choice, but standard POSIX and Windows APIs don't let you pass filenames with NULs in them correctly. So the point is moot. If you know of a system function which accepts filenames with embedded NULs (which probably means it also takes the filename length as a separate parameter), I'd be curious to know about it. Regards Antoine.

Chris Angelico

1:18 p.m.

On Wed, May 13, 2020 at 11:13 PM Antoine Pitrou <solipsis@pitrou.net> wrote:

...

On Wed, 13 May 2020 20:52:38 +1000 Steven D'Aprano <steve@pearwood.info> wrote:

...
I don't know how complete HFS+ support is on Linux or Windows, but in principle any OS that supports HFS+ or (maybe) Joliet could have files with NULs.

Remember that NULs may be legal next time you are stress testing your file IO code *wink*

NULs may be theoretically legal on your filesystem of choice, but standard POSIX and Windows APIs don't let you pass filenames with NULs in them correctly. So the point is moot.

If you know of a system function which accepts filenames with embedded NULs (which probably means it also takes the filename length as a separate parameter), I'd be curious to know about it.

I'm very curious to know if the ancient MS-DOS functions that take FCBs would be able to handle NULs, since they work with a fixed length filename. ChrisA

Eryk Sun

3:31 p.m.

On 5/13/20, Antoine Pitrou <solipsis@pitrou.net> wrote:

...

If you know of a system function which accepts filenames with embedded NULs (which probably means it also takes the filename length as a separate parameter), I'd be curious to know about it.

Windows is layered over the base NT system, which uses counted strings and a root object namespace that reserves only the path separator, backslash. Null characters are allowed, at least as far as the object manager cares, but using them is a bad idea, if only because such names aren't generally accessible in Windows. But let's look at an example just for kicks. When the object manager parses a path up to a Device object (e.g. "\Device\NamedPipe"), the I/O manager takes over parsing the remaining path, which calls the device driver's IRP_MJ_CREATE routine with the remaining path. Whether or not a name with nulls is allowed depends on the device driver -- or a filesystem driver if the device is mounted. Almost all filesystem drivers reject a component name that contains nulls as invalid. One exception is the named-pipe filesystem (NPFS). NPFS doesn't disallow any characters. It even allows backslash in pipe names since it doesn't support subdirectories, and if you check via os.listdir('//./pipe'), you should see several Winsock pipes with backslash in their name. Creating a pipe with nulls in its name is impossible via WINAPI CreateNamedPipeW. It requires native NtCreateNamedPipeFile, with the name passed in an OBJECT_ATTRIBUTES record [1]. This system function is undocumented, but just to show that it's possible in principle, I created a pipe named "spam\x00eggs". We can query the name via GetFileInformationByHandleEx: FileNameInfo [2], which returns a counted string: >>> GetFileInformationByHandleEx(h, FileNameInfo) '\\spam\x00eggs' The name is in the root path of the device, but we don't get the fully-qualified name "\\Device\\NamedPipe\\spam\x00eggs". WINAPI GetFinalPathNameByHandleW [3] can figure this out, at least for the native NT path (from NtQueryObject). However, it works with null-terminated strings, so the pipe name gets truncated as "spam": >>> flags = VOLUME_NAME_NT | FILE_NAME_OPENED >>> GetFinalPathNameByHandle(h, flags) '\\Device\\NamedPipe\\spam' [1]: https://docs.microsoft.com/en-us/windows/win32/api/ntdef/ns-ntdef-_object_at... [2]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_n... [3]: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfin...

Steve Jorgensen

8:58 p.m.

Andrew Barnert wrote:

...

On May 11, 2020, at 00:40, Steve Jorgensen stevej@stevej.name wrote:

...
Proposal: Add a new function (possibly os.path.sanitizepart) to sanitize a value for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters. “Also” in addition to what? Are there other requirements enforced besides these two that aren’t specified anywhere?

Sorry that was not clear. In addition to ensuring that it it a single part, meaning that it contains no path separators.

Steve Jorgensen

9:12 p.m.

Andrew Barnert wrote:

...

On May 11, 2020, at 00:40, Steve Jorgensen stevej@stevej.name wrote:

...
Proposal: Add a new function (possibly os.path.sanitizepart) to sanitize a value for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters. <snip> If not: the result can contain the path separator, illegal characters that aren’t control characters, nonprinting characters that aren’t control characters, and characters whose bytes (in the filesystem’s encoding) are ASCII control characters? And it can be a reserved name, or even something like C:; as long as it’s not the Unix . or ..?

Are there non-printing characters outside of those in the Unicode general category of "C" that make sense to omit? There are combining characters and such that do not have glyphs but are visible in the sense that they modify the glyphs displayed for the characters that they combine with. Regarding names like "C:", you are absolutely right to point that out. When the platform is Windows, certainly, "<letter>:" should not be allowed, and perhaps colon should not be allowed at all. I'll need to research that a bit. This matters because if the path part is used without explicit "./" prefixed to it, then it will refer to a root path, so same problem as allowing a name starting with "/" in *NIX. That should be unconditionally disallowed in the case of WIN or GENERAL systems.

Andrew Barnert

10:04 p.m.

...

On May 11, 2020, at 14:18, Steve Jorgensen <stevej@stevej.name> wrote:

Andrew Barnert wrote:

...
...
On May 11, 2020, at 00:40, Steve Jorgensen stevej@stevej.name wrote: Proposal: Add a new function (possibly os.path.sanitizepart) to sanitize a value for use as a single component of a path. In the default case, the value must also not be a reference to the current or parent directory ("." or "..") and must not contain control characters. <snip> If not: the result can contain the path separator, illegal characters that aren’t control characters, nonprinting characters that aren’t control characters, and characters whose bytes (in the filesystem’s encoding) are ASCII control characters? And it can be a reserved name, or even something like C:; as long as it’s not the Unix . or ..?

Are there non-printing characters outside of those in the Unicode general category of "C" that make sense to omit?

Off the top of my head, everything in the Z category (like U+2029 PARAGRAPH SEPARATOR) is non-printable, and makes sense to sanitize. Meanwhile, what about invalid characters being smuggled through str by surrogate_escape? I don’t know if those are printable, or what category they are… or whether you want to sanitize them, for that matter, so I have no idea if this rule does the right thing or not. More generally, we shouldn’t be relying on what respondents know off the top of their heads in the first place for something that people are going to rely on for security/safety purposes.

...

Regarding names like "C:", you are absolutely right to point that out. When the platform is Windows, certainly, "<letter>:" should not be allowed, and perhaps colon should not be allowed at all. I'll need to research that a bit. This matters because if the path part is used without explicit "./" prefixed to it, then it will refer to a root path,

The name `C:spam` means spam in the current directory for the C drive—which isn’t the same as the current working directory unless C is the current working drive, but it’s definitely not (in general) the same as the root. And what about all the other questions I asked? Most importantly, you need to clarify what the use case is, and why this proposal meets it. Otherwise, it sounds more like a trap to make people think their code is safe when it isn’t, not a fix for the real problem.

Oleg Broytman

10:48 p.m.

On Mon, May 11, 2020 at 09:12:52PM -0000, Steve Jorgensen <stevej@stevej.name> wrote:

...

When the platform is Windows, certainly, "<letter>:" should not be allowed, and perhaps colon should not be allowed at all.

https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file Forbidden characters: chr(0) < > : " / \ | ? * characters in range from chr(1) through chr(31), a space or a period at the end of file/directory name. Forbidden file names (with any extensions): CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Eryk Sun

8:11 a.m.

On 5/11/20, Oleg Broytman <phd@phdru.name> wrote:

...

On Mon, May 11, 2020 at 09:12:52PM -0000, Steve Jorgensen <stevej@stevej.name> wrote:

...
When the platform is Windows, certainly, "<letter>:" should not be allowed, and perhaps colon should not be allowed at all.

The meaning of "<letter>:name" is context dependent. If it occurs at the beginning of a path, it's relative to the working directory on drive "<letter>:", which defaults to the root directory on the drive. For example, if the working directory on drive "X:" is "X:\spam\eggs", then "X:foo" resolves to "X:\spam\eggs\foo". "X:foo" in this context is not a valid component name; it's actually a filepath. Otherwise "<letter>:" is part of an NTFS or ReFS stream path, where ":" is the stream delimiter. To be valid, it needs to be followed by either the name of the stream or the name plus the type, e.g. "filename:streamname" or "filename:streamname:streamtype". Should file streams be supported? More on File Streams An open or create will fail as an invalid filename if it uses invalid stream syntax or references a stream type that's unknown, or if the filesystem doesn't support streams and disallows colon in filenames (e.g. FAT32). The stream name can be empty to indicate an anonymous or default stream, but only if the stream type is specified. For example, in NTFS "filename::$DATA" is the anonymous data stream in a file named "filename". For a regular data file, it's the same as just accessing "filename". A directory can have named data streams, but it cannot have an anonymous data stream. The default stream in a directory is an index stream named "$I30". The following are equivalent names for a directory in NTFS: "dirname", "dirname::$INDEX_ALLOCATION", and "dirname:$I30:$INDEX_ALLOCATION". But "dirname:$I30" doesn't work because the default stream type is $DATA. To access a stream in a single-letter filename relative to the current directory, the current directory has to be referenced explicitly via the "." component. For example, "./C:spam" is a stream named "spam" in a file named "C" that's in the current working directory, but "C:spam" is a file named "spam" in the working directory on drive "C:".

...

Forbidden characters:

chr(0) < > : " / \ | ? *

characters in range from chr(1) through chr(31),

See the above discussion regarding ":". An NTFS stream name can include any character except for nul (0), colon, backslash, and slash. The characters *?"<> are the 5 wildcards characters that almost all NT filesystems disallow in filenames. These are important to disallow because the filesystem driver (in the kernel) is expected to support filtering a directory listing with a wildcard pattern. NT's * and ? wildcards have Unix shell semantics. The other three are DOS_DOT ("), DOS_STAR (<), and DOS_QM (>), which help to emulate MS-DOS behavior. The vertical bar or pipe (|) has no significance in filepaths, but it's a special shell character that's usually disallowed in filenames. Control characters 1-31 usually are also disallowed. That said, some non-Microsoft filesystems may allow these characters. For example, the VirtualBox shared-folder filesystem allows pipe and control characters in filenames.

...

a space or a period at the end of file/directory name.

Trailing spaces and dots are stripped from the final path component in almost all contexts. Except "\\?\" device paths are never normalized in an open or create context. For example, creating "\\?\C:\Temp\spam. . . " will name the file "spam. . . " instead of the normal name "spam". The name "spam. . . " will appear in the directory listing, but opening it will require using a "\\?\" device path.

...

Forbidden file names (with any extensions):

CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9.

In an attempt to replicate how MS-DOS implemented devices, Windows reserves DOS device names such as "NUL" in the final component of DOS drive-letter paths and relative paths. They are not reserved in the final component of UNC and device paths, though a server may disallow them by policy, as Microsoft's SMB server does. Matching the device name ignores everything after a trailing colon or dot that follows the name with 0 or more intervening spaces. This is more than ignoring an extension, which is typically taken as the characters following the last dot in a filename. "CONIN$" and "CONOUT$" are mistakenly excluded from the documented list of reserved DOS device names. Windows has always reserved them as unqualified relative names in a create/open context. Starting with Windows 8, they're reserved exactly the same as the classic DOS device names. Examples with trailing dots and spaces: >>> os.getcwd() 'C:\\' >>> nt._getfullpathname('spam. . . ') 'C:\\spam' >>> nt._getfullpathname('foo/spam. . . ') 'C:\\foo\\spam' DOS devices: >>> nt._getfullpathname('conin$:spam.eggs') '\\\\.\\conin$' >>> nt._getfullpathname('foo/conin$ .spam.eggs') '\\\\.\\conin$' Non-final component: >>> nt._getfullpathname('spam. . . /foo') 'C:\\spam. . . \\foo'

...

...
...
nt._getfullpathname('conin$/foo') 'C:\\conin$\\foo'

UNC and device paths: >>> nt._getfullpathname('//server/share/conin$') '\\\\server\\share\\conin$' >>> nt._getfullpathname('//./C:/conin$') '\\\\.\\C:\\conin$'

1715

Age (days ago)

1717

Last active (days ago)

List overview

Download

17 comments

9 participants

participants (9)

Andrew Barnert
Antoine Pitrou
Barry Scott
Chris Angelico
Eryk Sun
Oleg Broytman
Steve Jorgensen
Steven D'Aprano
Wes Turner

Sanitize filename (path part) 2nd try

tags

participants (9)