Sanitize filename (path part)

I believe the Python standard library should include a means of sanitizing a filesystem entry, and this should not be something requiring a 3rd party package. One of reasons I think this should be in the standard lib is because that provides a common, simple means for code reviewers and static analysis services such as Veracode to recognize that a value is sanitized in an accepted manner. What I am envisioning is a function (presumably in `os.path` with a signature roughly like {{{ sanitizepart(name, permissive=False, mode=ESCAPE, system=None) }}} When `permissive` is `False`, characters that are generally unsafe are rejected. When `permissive` is `True`, only path separator characters are rejected. Generally unsafe characters besides path separators would include things like a leading ".", any non-printing character, any wildcard, piping and redirection characters, etc. The `mode` argument indicates what to do with unacceptable characters. Escape them (`ESCAPE`), omit them (`OMIT`) or raise an exception (`RAISE`). This could also double as an escape character argument when a string is given. The default escape character should probably be "%" (same as URL encoding). The `system` argument accepts a combination of bit flags indicating what operating system's rules to apply, or `None` meaning to use rules for the current platform. Systems would probably include `SYS_POSIX`, `SYS_WIN`, and `SYS_MISC` where miscellaneous means to enforce rules for all commonly used systems. One example of a distinction is that on a POSIX system, backslash characters are not path separators, but on Windows, both forward and backward slashes are path separators. {{{ from os import path from os.path import sanitizepart print(repr( os.path.sanitizepart('/ABC\\QRS%', system=path.SYS_WIN)) # => '%2fABC%5cQRS%%' os.path.sanitizepart('/ABC\\QRS%', True, mode=path.STRIP, system=path.SYS_POSIX)) # => 'ABC\\QRS%' os.path.sanitizepart('../AB&CD*\x01\n', system=path.SYS_POSIX)) # => '%2e.%2fABC%26CD%2a%01%10' os.path.sanitizepart('../AB&CD*\x01\n', True, system=path.SYS_POSIX)) # => '..%2eAB&CD*\x01\n' }}}


Steve Jorgensen wrote:
More existing work: * https://pypi.org/project/sanitize-filename/ * http://detox.sourceforge.net/ * https://sourceforge.net/p/glindra/news/2005/08/glindra-rename--lower--portab...

On May 9, 2020, at 17:35, Steve Jorgensen <stevej@stevej.name> wrote:
I believe the Python standard library should include a means of sanitizing a filesystem entry, and this should not be something requiring a 3rd party package.
One of reasons I think this should be in the standard lib is because that provides a common, simple means for code reviewers and static analysis services such as Veracode to recognize that a value is sanitized in an accepted manner.
This does seem like a good idea. People who do this themselves get it wrong all the time, occasionally with disastrous consequences, so if Python can solve that, that would be great. But, at least historically, this has been more complicated than what you’re suggesting here. For example, don’t you have to catch things like directories named “Con” or files whose 8.3 representation has “CON” as the 8 part? I don’t think you can hang an entire Windows system by abusing those anymore, but you can still produce filenames that some APIs, and some tools (possibly including Explorer, cmd, powershell, Cygwin, mingw/native shells, Python itself…) can’t access (or can only access if the user manually specified a \\.\ absolute path, or whatever). Is there an established algorithm/rule that lots of people in the industry trust that Python can just reference, instead of having to research or invent it? Because otherwise, we run the risk of making things worse instead of better.
Maybe it would make more sense to put this in pathlib. Then you construct a PurePath of the appropriate type, and call sanitize() on it (maybe with a flag that ensures that it’s a single path component if you expected it to be one). I think some, but not all, of this logic already exists in pathlib.
When `permissive` is `False`, characters that are generally unsafe are rejected. When `permissive` is `True`, only path separator characters are rejected. Generally unsafe characters besides path separators would include things like a leading ".", any non-printing character, any wildcard, piping and redirection characters, etc.
I think neither of these is what I’d usually want. I never want to sanitize just pathsep characters without sanitizing all illegal characters. I do often want to sanitize all illegal characters (just \0 and the path sep on POSIX, a larger set that I don’t know by heart on Windows). I don’t think I’ve ever wanted to sanitize the set of potentially-unsafe characters you’re proposing here. I have wanted to sanitize (or pop up an “are you sure?” dialog, etc.) a wider range of potentially confusing characters. For example, newlines or Unicode separators can be very confusing in filenames. I’ve used one of those “potentially misleading URL” libs for this even though files and URLs aren’t quite the same and it was definitely overzealous, but if I’m not really confident that someone has thought through the details and widely vetted them, I’d rather have overzealous than underzealous for something like this. Meanwhile, on POSIX, it’s actually bytes rather than characters that are illegal. Any character that, in the filesystem’s encoding, would have a \0 or \x2f is therefore illegal. Of course in UTF-8, the only such characters are NUL and /, so in scripts I write for my own use on my own systems where I know all the filesystems are UTF-8 I don’t worry about this But mething meant for hardening/verification tools seems like it needs to meet a higher standard and work on more varied systems. And I don’t know how you could even apply the right rule without knowing what the file system encoding is (which means you need the full path, not just the component to be checked) or requiring bytes rather than str (but then it doesn’t work for Windows, and resolving that whole mess gets extra fun, and even on POSIX it’s a lot less common to use). Speaking of encodings and Windows, isn’t any character not in the user’s OEM code page likely to be confusing? Sure, it’ll work with other Python 3.8 scripts, but it’ll crash or do the wrong thing or display mojibake when used with lots of other tools.
The `mode` argument indicates what to do with unacceptable characters. Escape them (`ESCAPE`), omit them (`OMIT`) or raise an exception (`RAISE`).
What’s the exception, and what attributes does it have? Usually I don’t care too much as long as the traceback/log entry/whatever is good enough for debugging, but for this function, I think I’d often want to be able to programmatically access the character(s) that triggered the error so I can tell the user. Especially if the rule isn’t a fixed, well-known one that you can describe the way Windows Explorer does when you try to use an illegal character.
This could also double as an escape character argument when a string is given. The default escape character should probably be "%" (same as URL encoding).
But % only makes sense with a specific encoding of the escaped character, which is a totally different encoding than the one used by other escape mechanisms, so how can just an escape string select between them? If I give it \U expecting to get JSON escapes but instead get % escapes with \U in place of %, that won’t be at all useful. In fact, passing any string at all besides % won’t be at all useful, because I don’t think there’s any other escape mechanism with the same rules as %-encoding but a different escape character. Not only that, but %-encoding doesn’t make sense with a different list of characters to be encoded than the one actually used by URLs, most obviously because % itself is not an unsafe filename character but you’d better be escaping it anyway, or the system is ridiculously easy to break/exploit. More importantly, what would %-encoding be good for? No other program—including Finder/Explorer, native GUI apps, native shell tools, etc.—will know how to generate the same name from the same user input, much less how to convert it back to something human-readable, etc. Even browsers following file: URLs won’t be able to use these names, and in fact it’ll be pretty confusing that (a) it’s misleadingly close to URL escaping but not the same, and (b) you have to %-escape the %-escaped filename to actually get a usable file URL out of it.

Responding to points individually to avoid confusing multi-topic threads. :) Andrew Barnert wrote: < snip >
Sanitization and validation are not the same thing though. \0 is invalid and will result in an error when passed to a function that attempts to use it to reference a file, so allowing that character to pass through sanitization doesn't constitute an exploitable vulnerability. Having said that, it's usually friendlier to fail sooner rather than later, so it maybe it actually does make sense for sanitization to fail for illegal characters as well as for valid, unsafe characters. Hmm. I just realized that "..." and (to a lesser extent) "." are valid path parts but are nevertheless usually not safe to allow.

Andrew Barnert wrote:
Yes. I am aware of some of the unsafe names in DOS and older Windows. As I mentioned in my other reply, there is a distinction between the ones that are merely invalid and those that are actually unsafe. In researching existing Linux tools just now, I was reminded that a leading dash is frequently unsafe because many tools will treat an argument starting with dash as an option argument.
An excellent point! I just started digging into that and found references to detox and Glindra. Neither of those seems to be well maintained though. The documentation pages for Glindra no longer exist and detox is not in standard package repositories for CentOS later than 6 (and only in EPEL for that. Still digging.

Steve Jorgensen wrote:
Extremely apropos to the question of what charters might be problematic and/or unsafe: https://dwheeler.com/essays/fixing-unix-linux-filenames.html

Steve Jorgensen wrote:
That article links to another by the same author that is specific to vulnerabilities caused by file names. https://dwheeler.com/secure-programs/Secure-Programs-HOWTO/file-names.html

[...] If a component is an absolute path, all previous components are
FWIW, here are some of the CWE codes for related vulnerabilities/weaknesses in implementations: CWE-73: External Control of File Name or Path https://cwe.mitre.org/data/definitions/73.html CWE-707: Improper Neutralization https://cwe.mitre.org/data/definitions/707.html CWE-22: Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal') https://cwe.mitre.org/data/definitions/22.html Because this behavior of os.path.join is documented, it's not a vuln in Python, it's a vuln in every downstream component that (1) uses os.path.join with user supplied input; and that (2) doesn't strip a leading '/' from path parts before joining them with os.path.join. https://docs.python.org/3/library/os.path.html#os.path.join thrown away and joining continues from the absolute path component. [quoting from "part 2"] What does sanitizepart do with a leading slash? assert os.path.join("a", "/b") == "/b" A new safejoin() or joinsafe() or join(safe='True') could call sanitizepart() such that: assert joinsafe("a\n", "/b") == "a\\n/b" On Sun, May 10, 2020 at 5:36 AM Steve Jorgensen <stevej@stevej.name> wrote:

(Is it almost always better to just use a hash of the provided filename (maybe in a p/a/ir/tree234 implementation to avoid the max files in a directory limit of whichever filesystem) instead of the user-supplied filename string?) On Mon, May 11, 2020 at 4:48 PM Wes Turner <wes.turner@gmail.com> wrote:

On Sun, 10 May 2020 00:34:43 -0000 "Steve Jorgensen" <stevej@stevej.name> wrote:
I'm not disagreeing.
Okay, now I'm disagreeing. ;-) I know what sanitize means (in English and in the technical sense I believe you intend here), but can you provide some context and actual use cases? Sanitize on input so that your application code doesn't "accidentally" spit out the contents of /etc/shadow? Sanitize on output so that your code doesn't produce syntactically broken links in an HTML document or weird results in an xterm? Sanitize in both directions for safe round tripping to a database server? All of those use cases potentially require separate handling, especially in terms of quoting and escaping. For another example, suppose I'm writing a command line utility on a POSIX system to compute a hash of the contents of a file. There's nothing wrong with ".profile" as a file name. Why are you rejecting leading "." characters? What about leading "-"s, or embedded "|"s? Yes, certain shells and shell commands can make them "difficult" to deal with in one way or another, but they're not "generally unsafe." A very, very, very long time ago, we wrote some software for a customer who liked to "editing" our data files to make minor corrections instead of using our software. Our solution was to use "illegal" filenames that the shell rejected, but that an application could access directly anyway. I guess the point is that "sanitize" can mean different things to different parts of a system. Dan -- “Atoms are not things.” – Werner Heisenberg Dan Sommers, http://www.tombstonezero.net/dan

Dan Sommers wrote:
I totally get what you're saying. For the sake of simplicity, I thought that the 2 permissiveness options should be one that only prevents path traversal and one that is extremely conservative, omitting characters that are often safe and appropriate but may be unsafe in some cases. In regard to dot files, those can be safe in some cases, but unsafe in others — writing to configuration files that will be read by shell helpers or editors, for instance.

On 5/10/2020 4:04 PM, Steve Jorgensen wrote:
I totally get what you're saying. For the sake of simplicity, I thought that the 2 permissiveness options should be one that only prevents path traversal and one that is extremely conservative, omitting characters that are often safe and appropriate but may be unsafe in some cases.
In regard to dot files, those can be safe in some cases, but unsafe in others — writing to configuration files that will be read by shell helpers or editors, for instance.
I don't see how it's realistic to come up with a version that would fit in the stdlib, especially when the stdlib itself has no need for it. It seems like this would be best on PyPI, and I understand there's already at least a few examples of that. Eric

Dan Sommers wrote: <snip>
I'm thinking of this specifically in terms of sanitizing input, assuming that later usage of the value might or might not properly protect against potential vulnerabilities. This is also limited to the case where the value is supposed to be a single path referring to an entry within a single directory context.

Steve Jorgensen writes:
This sounds extremely specialized to me. For example, presumably you're not referring to dotted module specifications in Python, but those usually do map to filesystem paths in implementations, and I can imagine vulnerabilities (the one on top of my head requires a fair amount of Python ignorance and environmental serendipity, which sort of proves my point about situation-specificity) using Python module paths as mapped to filesystem paths. ISTM that it might be useful to provide a toolbox for scanning paths with various validation operations, but that it's really up to applications to decide which operations to use and what parameters (eg, evil code point set, bytes vs code points vs code units vs characters), and so on. PyPI seems ideal for that, until it matures more than a discussion on the mailing lists can provide. Steve (T)

On 10 May 2020, at 01:34, Steve Jorgensen <stevej@stevej.name> wrote:
I believe the Python standard library should include a means of sanitizing a filesystem entry, and this should not be something requiring a 3rd party package.
snip I found that I needed to have code that could tell me if a filename was valid for the OS I'm on. I'm not sure where sanitising would be useful, if in valid I ask the user to fix with suitable feedback in my UI. There is more than one problem to address. 1. Is the string valid as the path to a filename on this OS and a particular file system? 2. Does this valid path refer to a device and not a file? 3. Is this path meeting the security requirements of the application? (1) is possible to code against specs for Windows, macOS and posix for the default file system. Knowing the exact file system allows further constraints to be checked for. (2) on posix is usually a check using stat() for the type of the file. On Windows this check is complicated by needing to know the names of all the devices and check for them. There are API calls that allow this list to be determined at runtime. And the parsing rules mean that "COM1" is an RS232 port as is "c:\windows\com1" and "com1.txt" (3) needs a threat-model to determine that paths that are considered a security risk. Implementing (1) and (2) is doable. (3) might be possible as a API that takes a list of black-listed locations to check for. I have code for (1) and a weak version of (2) in SCM Workbench. For (3) I have relied on file system permissions to prevent harm. Windows version (MSDN documents the char set to that is allowed): __filename_bad_chars_set = set( '\\:/\000?<>*|"' ) __filename_reserved_names = set( ['nul', 'con', 'aux', 'prn', 'com1', 'com2', 'com3', 'com4', 'com5', 'com6', 'com7', 'com8', 'com9', 'lpt1', 'lpt2', 'lpt3', 'lpt4', 'lpt5', 'lpt6', 'lpt7', 'lpt8', 'lpt9', ] ) def isInvalidFilename( filename ): name_set = set( filename ) if len( name_set.intersection( __filename_bad_chars_set ) ) != 0: return True name = filename.split( '.' )[0] if name.lower() in __filename_reserved_names: return True return False macOS and Unix version (I only use Unicode input so avoid the random bytes problems): __filename_bad_chars_set = set( '/\000' ) def isInvalidFilename( filename ): name_set = set( filename ) if len( name_set.intersection( __filename_bad_chars_set ) ) != 0: return True return False Barry

On May 11, 2020, at 13:31, Barry Scott <barry@barrys-emacs.org> wrote:
macOS and Unix version (I only use Unicode input so avoid the random bytes problems):
But that doesn’t avoid the problem. If someone gives you a character whose encoding on the target filesystem includes a null or pathsep byte, your sanitizer will pass it as safe, when it shouldn’t. This isn’t possible on macOS because the OS won’t let you mount any filesystem whose encoding isn’t UTF-8, but it is possible on most other *nixes, and it has been used as an attack in the past. Is it still a realistic problem today? I don’t know. I’m pretty sure the modern versions of Shift-JIS, EUC-*, Big5, and GB can never have continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, of course) the only multi-byte encodings anyone ever uses on Unix filesystems?

Do you have a example that shows an encoding that produces a NUL or pathsep? I'm not aware of any.
This isn’t possible on macOS because the OS won’t let you mount any filesystem whose encoding isn’t UTF-8, but it is possible on most other *nixes, and it has been used as an attack in the past.
Indeed the case of mounting an NTFS filesystem on Linux now requires the use of the NTFS rules to validate the filename,
Is it still a realistic problem today? I don’t know. I’m pretty sure the modern versions of Shift-JIS, EUC-*, Big5, and GB can never have continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, of course) the only multi-byte encodings anyone ever uses on Unix filesystems?
I suspect that legacy encoding are used in organisations with old data, but do have direct experience of this. Barry

On May 12, 2020, at 01:32, Barry Scott <barry@barrys-emacs.org> wrote:
UTF-1 encodes U+D7FF to the bytes F7 2F C3. BOCU has similar examples. In the other direction, MUTF-8 decodes the bytes CO 80 to U+0000. There were a number of cross-site scripting and misleading-link attacks abusing (mostly) BOCU in this way, which is part of the reason WHATWG banned them as charsets. Although there were other reasons (they banned stuff like SCSU and CESU-8 and UTF-7 at the same time, and I don’t think any of them have the same problem). And if there were widespread legitimate uses of these codecs, they probably wouldn’t have been banned (see UTF-16LE, which is even easier to exploit this way, but unfortunately way too common). I don’t think Python comes with codecs for any of these encodings. And I don’t know of anyone who ever used them for filenames. (SCSU was the default fs encoding on Symbian flash memory drives, but again, I don’t think it has this problem.) So this may well not be a practical problem.
Is it still a realistic problem today? I don’t know. I’m pretty sure the modern versions of Shift-JIS, EUC-*, Big5, and GB can never have continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, of course) the only multi-byte encodings anyone ever uses on Unix filesystems?
I suspect that legacy encoding are used in organisations with old data, but do have direct experience of this.
I have direct experience of some of those East Asian codecs, albeit 15 or so years ago. I’m pretty sure the only ones they used were all safe. I also have experience even further back of mounting drives from Ataris and classic Macs and IBM mainframes and all kinds of other crazy things under Unix, but the filesystem drivers recoded filenames on the fly, along with providing a Unix-style hierarchical filesystem, so user-level code didn’t have to worry about MacKorean or EBCDIC or whatever any more than it had to worry about : as a pathsep and absolute paths being the ones that _don’t_ start with a pathsep and so on. So, based on my experience, it doesn’t seem likely to come up even in shops full of old data. But that experience isn’t worth much…


Steve Jorgensen wrote:
More existing work: * https://pypi.org/project/sanitize-filename/ * http://detox.sourceforge.net/ * https://sourceforge.net/p/glindra/news/2005/08/glindra-rename--lower--portab...

On May 9, 2020, at 17:35, Steve Jorgensen <stevej@stevej.name> wrote:
I believe the Python standard library should include a means of sanitizing a filesystem entry, and this should not be something requiring a 3rd party package.
One of reasons I think this should be in the standard lib is because that provides a common, simple means for code reviewers and static analysis services such as Veracode to recognize that a value is sanitized in an accepted manner.
This does seem like a good idea. People who do this themselves get it wrong all the time, occasionally with disastrous consequences, so if Python can solve that, that would be great. But, at least historically, this has been more complicated than what you’re suggesting here. For example, don’t you have to catch things like directories named “Con” or files whose 8.3 representation has “CON” as the 8 part? I don’t think you can hang an entire Windows system by abusing those anymore, but you can still produce filenames that some APIs, and some tools (possibly including Explorer, cmd, powershell, Cygwin, mingw/native shells, Python itself…) can’t access (or can only access if the user manually specified a \\.\ absolute path, or whatever). Is there an established algorithm/rule that lots of people in the industry trust that Python can just reference, instead of having to research or invent it? Because otherwise, we run the risk of making things worse instead of better.
Maybe it would make more sense to put this in pathlib. Then you construct a PurePath of the appropriate type, and call sanitize() on it (maybe with a flag that ensures that it’s a single path component if you expected it to be one). I think some, but not all, of this logic already exists in pathlib.
When `permissive` is `False`, characters that are generally unsafe are rejected. When `permissive` is `True`, only path separator characters are rejected. Generally unsafe characters besides path separators would include things like a leading ".", any non-printing character, any wildcard, piping and redirection characters, etc.
I think neither of these is what I’d usually want. I never want to sanitize just pathsep characters without sanitizing all illegal characters. I do often want to sanitize all illegal characters (just \0 and the path sep on POSIX, a larger set that I don’t know by heart on Windows). I don’t think I’ve ever wanted to sanitize the set of potentially-unsafe characters you’re proposing here. I have wanted to sanitize (or pop up an “are you sure?” dialog, etc.) a wider range of potentially confusing characters. For example, newlines or Unicode separators can be very confusing in filenames. I’ve used one of those “potentially misleading URL” libs for this even though files and URLs aren’t quite the same and it was definitely overzealous, but if I’m not really confident that someone has thought through the details and widely vetted them, I’d rather have overzealous than underzealous for something like this. Meanwhile, on POSIX, it’s actually bytes rather than characters that are illegal. Any character that, in the filesystem’s encoding, would have a \0 or \x2f is therefore illegal. Of course in UTF-8, the only such characters are NUL and /, so in scripts I write for my own use on my own systems where I know all the filesystems are UTF-8 I don’t worry about this But mething meant for hardening/verification tools seems like it needs to meet a higher standard and work on more varied systems. And I don’t know how you could even apply the right rule without knowing what the file system encoding is (which means you need the full path, not just the component to be checked) or requiring bytes rather than str (but then it doesn’t work for Windows, and resolving that whole mess gets extra fun, and even on POSIX it’s a lot less common to use). Speaking of encodings and Windows, isn’t any character not in the user’s OEM code page likely to be confusing? Sure, it’ll work with other Python 3.8 scripts, but it’ll crash or do the wrong thing or display mojibake when used with lots of other tools.
The `mode` argument indicates what to do with unacceptable characters. Escape them (`ESCAPE`), omit them (`OMIT`) or raise an exception (`RAISE`).
What’s the exception, and what attributes does it have? Usually I don’t care too much as long as the traceback/log entry/whatever is good enough for debugging, but for this function, I think I’d often want to be able to programmatically access the character(s) that triggered the error so I can tell the user. Especially if the rule isn’t a fixed, well-known one that you can describe the way Windows Explorer does when you try to use an illegal character.
This could also double as an escape character argument when a string is given. The default escape character should probably be "%" (same as URL encoding).
But % only makes sense with a specific encoding of the escaped character, which is a totally different encoding than the one used by other escape mechanisms, so how can just an escape string select between them? If I give it \U expecting to get JSON escapes but instead get % escapes with \U in place of %, that won’t be at all useful. In fact, passing any string at all besides % won’t be at all useful, because I don’t think there’s any other escape mechanism with the same rules as %-encoding but a different escape character. Not only that, but %-encoding doesn’t make sense with a different list of characters to be encoded than the one actually used by URLs, most obviously because % itself is not an unsafe filename character but you’d better be escaping it anyway, or the system is ridiculously easy to break/exploit. More importantly, what would %-encoding be good for? No other program—including Finder/Explorer, native GUI apps, native shell tools, etc.—will know how to generate the same name from the same user input, much less how to convert it back to something human-readable, etc. Even browsers following file: URLs won’t be able to use these names, and in fact it’ll be pretty confusing that (a) it’s misleadingly close to URL escaping but not the same, and (b) you have to %-escape the %-escaped filename to actually get a usable file URL out of it.

Responding to points individually to avoid confusing multi-topic threads. :) Andrew Barnert wrote: < snip >
Sanitization and validation are not the same thing though. \0 is invalid and will result in an error when passed to a function that attempts to use it to reference a file, so allowing that character to pass through sanitization doesn't constitute an exploitable vulnerability. Having said that, it's usually friendlier to fail sooner rather than later, so it maybe it actually does make sense for sanitization to fail for illegal characters as well as for valid, unsafe characters. Hmm. I just realized that "..." and (to a lesser extent) "." are valid path parts but are nevertheless usually not safe to allow.

Andrew Barnert wrote:
Yes. I am aware of some of the unsafe names in DOS and older Windows. As I mentioned in my other reply, there is a distinction between the ones that are merely invalid and those that are actually unsafe. In researching existing Linux tools just now, I was reminded that a leading dash is frequently unsafe because many tools will treat an argument starting with dash as an option argument.
An excellent point! I just started digging into that and found references to detox and Glindra. Neither of those seems to be well maintained though. The documentation pages for Glindra no longer exist and detox is not in standard package repositories for CentOS later than 6 (and only in EPEL for that. Still digging.

Steve Jorgensen wrote:
Extremely apropos to the question of what charters might be problematic and/or unsafe: https://dwheeler.com/essays/fixing-unix-linux-filenames.html

Steve Jorgensen wrote:
That article links to another by the same author that is specific to vulnerabilities caused by file names. https://dwheeler.com/secure-programs/Secure-Programs-HOWTO/file-names.html

[...] If a component is an absolute path, all previous components are
FWIW, here are some of the CWE codes for related vulnerabilities/weaknesses in implementations: CWE-73: External Control of File Name or Path https://cwe.mitre.org/data/definitions/73.html CWE-707: Improper Neutralization https://cwe.mitre.org/data/definitions/707.html CWE-22: Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal') https://cwe.mitre.org/data/definitions/22.html Because this behavior of os.path.join is documented, it's not a vuln in Python, it's a vuln in every downstream component that (1) uses os.path.join with user supplied input; and that (2) doesn't strip a leading '/' from path parts before joining them with os.path.join. https://docs.python.org/3/library/os.path.html#os.path.join thrown away and joining continues from the absolute path component. [quoting from "part 2"] What does sanitizepart do with a leading slash? assert os.path.join("a", "/b") == "/b" A new safejoin() or joinsafe() or join(safe='True') could call sanitizepart() such that: assert joinsafe("a\n", "/b") == "a\\n/b" On Sun, May 10, 2020 at 5:36 AM Steve Jorgensen <stevej@stevej.name> wrote:

(Is it almost always better to just use a hash of the provided filename (maybe in a p/a/ir/tree234 implementation to avoid the max files in a directory limit of whichever filesystem) instead of the user-supplied filename string?) On Mon, May 11, 2020 at 4:48 PM Wes Turner <wes.turner@gmail.com> wrote:

On Sun, 10 May 2020 00:34:43 -0000 "Steve Jorgensen" <stevej@stevej.name> wrote:
I'm not disagreeing.
Okay, now I'm disagreeing. ;-) I know what sanitize means (in English and in the technical sense I believe you intend here), but can you provide some context and actual use cases? Sanitize on input so that your application code doesn't "accidentally" spit out the contents of /etc/shadow? Sanitize on output so that your code doesn't produce syntactically broken links in an HTML document or weird results in an xterm? Sanitize in both directions for safe round tripping to a database server? All of those use cases potentially require separate handling, especially in terms of quoting and escaping. For another example, suppose I'm writing a command line utility on a POSIX system to compute a hash of the contents of a file. There's nothing wrong with ".profile" as a file name. Why are you rejecting leading "." characters? What about leading "-"s, or embedded "|"s? Yes, certain shells and shell commands can make them "difficult" to deal with in one way or another, but they're not "generally unsafe." A very, very, very long time ago, we wrote some software for a customer who liked to "editing" our data files to make minor corrections instead of using our software. Our solution was to use "illegal" filenames that the shell rejected, but that an application could access directly anyway. I guess the point is that "sanitize" can mean different things to different parts of a system. Dan -- “Atoms are not things.” – Werner Heisenberg Dan Sommers, http://www.tombstonezero.net/dan

Dan Sommers wrote:
I totally get what you're saying. For the sake of simplicity, I thought that the 2 permissiveness options should be one that only prevents path traversal and one that is extremely conservative, omitting characters that are often safe and appropriate but may be unsafe in some cases. In regard to dot files, those can be safe in some cases, but unsafe in others — writing to configuration files that will be read by shell helpers or editors, for instance.

On 5/10/2020 4:04 PM, Steve Jorgensen wrote:
I totally get what you're saying. For the sake of simplicity, I thought that the 2 permissiveness options should be one that only prevents path traversal and one that is extremely conservative, omitting characters that are often safe and appropriate but may be unsafe in some cases.
In regard to dot files, those can be safe in some cases, but unsafe in others — writing to configuration files that will be read by shell helpers or editors, for instance.
I don't see how it's realistic to come up with a version that would fit in the stdlib, especially when the stdlib itself has no need for it. It seems like this would be best on PyPI, and I understand there's already at least a few examples of that. Eric

Dan Sommers wrote: <snip>
I'm thinking of this specifically in terms of sanitizing input, assuming that later usage of the value might or might not properly protect against potential vulnerabilities. This is also limited to the case where the value is supposed to be a single path referring to an entry within a single directory context.

Steve Jorgensen writes:
This sounds extremely specialized to me. For example, presumably you're not referring to dotted module specifications in Python, but those usually do map to filesystem paths in implementations, and I can imagine vulnerabilities (the one on top of my head requires a fair amount of Python ignorance and environmental serendipity, which sort of proves my point about situation-specificity) using Python module paths as mapped to filesystem paths. ISTM that it might be useful to provide a toolbox for scanning paths with various validation operations, but that it's really up to applications to decide which operations to use and what parameters (eg, evil code point set, bytes vs code points vs code units vs characters), and so on. PyPI seems ideal for that, until it matures more than a discussion on the mailing lists can provide. Steve (T)

On 10 May 2020, at 01:34, Steve Jorgensen <stevej@stevej.name> wrote:
I believe the Python standard library should include a means of sanitizing a filesystem entry, and this should not be something requiring a 3rd party package.
snip I found that I needed to have code that could tell me if a filename was valid for the OS I'm on. I'm not sure where sanitising would be useful, if in valid I ask the user to fix with suitable feedback in my UI. There is more than one problem to address. 1. Is the string valid as the path to a filename on this OS and a particular file system? 2. Does this valid path refer to a device and not a file? 3. Is this path meeting the security requirements of the application? (1) is possible to code against specs for Windows, macOS and posix for the default file system. Knowing the exact file system allows further constraints to be checked for. (2) on posix is usually a check using stat() for the type of the file. On Windows this check is complicated by needing to know the names of all the devices and check for them. There are API calls that allow this list to be determined at runtime. And the parsing rules mean that "COM1" is an RS232 port as is "c:\windows\com1" and "com1.txt" (3) needs a threat-model to determine that paths that are considered a security risk. Implementing (1) and (2) is doable. (3) might be possible as a API that takes a list of black-listed locations to check for. I have code for (1) and a weak version of (2) in SCM Workbench. For (3) I have relied on file system permissions to prevent harm. Windows version (MSDN documents the char set to that is allowed): __filename_bad_chars_set = set( '\\:/\000?<>*|"' ) __filename_reserved_names = set( ['nul', 'con', 'aux', 'prn', 'com1', 'com2', 'com3', 'com4', 'com5', 'com6', 'com7', 'com8', 'com9', 'lpt1', 'lpt2', 'lpt3', 'lpt4', 'lpt5', 'lpt6', 'lpt7', 'lpt8', 'lpt9', ] ) def isInvalidFilename( filename ): name_set = set( filename ) if len( name_set.intersection( __filename_bad_chars_set ) ) != 0: return True name = filename.split( '.' )[0] if name.lower() in __filename_reserved_names: return True return False macOS and Unix version (I only use Unicode input so avoid the random bytes problems): __filename_bad_chars_set = set( '/\000' ) def isInvalidFilename( filename ): name_set = set( filename ) if len( name_set.intersection( __filename_bad_chars_set ) ) != 0: return True return False Barry

On May 11, 2020, at 13:31, Barry Scott <barry@barrys-emacs.org> wrote:
macOS and Unix version (I only use Unicode input so avoid the random bytes problems):
But that doesn’t avoid the problem. If someone gives you a character whose encoding on the target filesystem includes a null or pathsep byte, your sanitizer will pass it as safe, when it shouldn’t. This isn’t possible on macOS because the OS won’t let you mount any filesystem whose encoding isn’t UTF-8, but it is possible on most other *nixes, and it has been used as an attack in the past. Is it still a realistic problem today? I don’t know. I’m pretty sure the modern versions of Shift-JIS, EUC-*, Big5, and GB can never have continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, of course) the only multi-byte encodings anyone ever uses on Unix filesystems?

Do you have a example that shows an encoding that produces a NUL or pathsep? I'm not aware of any.
This isn’t possible on macOS because the OS won’t let you mount any filesystem whose encoding isn’t UTF-8, but it is possible on most other *nixes, and it has been used as an attack in the past.
Indeed the case of mounting an NTFS filesystem on Linux now requires the use of the NTFS rules to validate the filename,
Is it still a realistic problem today? I don’t know. I’m pretty sure the modern versions of Shift-JIS, EUC-*, Big5, and GB can never have continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, of course) the only multi-byte encodings anyone ever uses on Unix filesystems?
I suspect that legacy encoding are used in organisations with old data, but do have direct experience of this. Barry

On May 12, 2020, at 01:32, Barry Scott <barry@barrys-emacs.org> wrote:
UTF-1 encodes U+D7FF to the bytes F7 2F C3. BOCU has similar examples. In the other direction, MUTF-8 decodes the bytes CO 80 to U+0000. There were a number of cross-site scripting and misleading-link attacks abusing (mostly) BOCU in this way, which is part of the reason WHATWG banned them as charsets. Although there were other reasons (they banned stuff like SCSU and CESU-8 and UTF-7 at the same time, and I don’t think any of them have the same problem). And if there were widespread legitimate uses of these codecs, they probably wouldn’t have been banned (see UTF-16LE, which is even easier to exploit this way, but unfortunately way too common). I don’t think Python comes with codecs for any of these encodings. And I don’t know of anyone who ever used them for filenames. (SCSU was the default fs encoding on Symbian flash memory drives, but again, I don’t think it has this problem.) So this may well not be a practical problem.
Is it still a realistic problem today? I don’t know. I’m pretty sure the modern versions of Shift-JIS, EUC-*, Big5, and GB can never have continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, of course) the only multi-byte encodings anyone ever uses on Unix filesystems?
I suspect that legacy encoding are used in organisations with old data, but do have direct experience of this.
I have direct experience of some of those East Asian codecs, albeit 15 or so years ago. I’m pretty sure the only ones they used were all safe. I also have experience even further back of mounting drives from Ataris and classic Macs and IBM mainframes and all kinds of other crazy things under Unix, but the filesystem drivers recoded filenames on the fly, along with providing a Unix-style hierarchical filesystem, so user-level code didn’t have to worry about MacKorean or EBCDIC or whatever any more than it had to worry about : as a pathsep and absolute paths being the ones that _don’t_ start with a pathsep and so on. So, based on my experience, it doesn’t seem likely to come up even in shops full of old data. But that experience isn’t worth much…
participants (7)
-
Andrew Barnert
-
Barry Scott
-
Dan Sommers
-
Eric V. Smith
-
Stephen J. Turnbull
-
Steve Jorgensen
-
Wes Turner