casefolding in pathlib (PEP 428)

Hey Antoine, Some of my Dropbox colleagues just drew my attention to the occurrence of case folding in pathlib.py. Basically, case folding as an approach to comparing pathnames is fatally flawed. The issues include: - most OSes these days allow the mounting of both case-sensitive and case-insensitive filesystems simultaneously - the case-folding algorithm on some filesystems is burned into the disk when the disk is formatted - case folding requires domain knowledge, e.g. turkish dotless I - normalization is a mess, even on OSX, where it's better defined than elsewhere One or more of them may reply-all to this message with more details. -- --Guido van Rossum (python.org/~guido)

On Thu, Apr 11, 2013 at 02:11:21PM -0700, Guido van Rossum <guido@python.org> wrote:
- the case-folding algorithm on some filesystems is burned into the disk when the disk is formatted
Into the partition, I guess, not the physical disc? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Fri, Apr 12, 2013 at 09:29:44AM +1200, Robert Collins <robertc@robertcollins.net> wrote:
Ah, I've completely forgotten about that one. I was thinking in terms of filesystems. Thank you for reminding! Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, 11 Apr 2013 14:11:21 -0700 Guido van Rossum <guido@python.org> wrote:
The problem is that: - if you always make the comparison case-sensitive, you'll get false negatives - if you make the comparison case-insensitive under Windows, you'll get false positives My assumption was that, globally, the number of false positives in case (2) is much less than the number of false negatives in case (1). On the other hand, one could argue that all comparisons should be case-sensitive *and* the proper way to test for "identical" paths is to access the filesystem. Which makes me think, perhaps concrete paths should get a "samefile" method as in os.path.samefile(). Hmm, I think I'm tending towards the latter right now. Regards Antoine.

On Thu, Apr 11, 2013 at 2:27 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Python on OSX has been using (1) for a decade now without major problems. Perhaps it would be best if the code never called lower() or upper() (not even indirectly via os.path.normcase()). Then any case-folding and path-normalization bugs are the responsibility of the application, and we won't have to worry about how to fix the stdlib without breaking backwards compatibility if we ever figure out how to fix this (which I somehow doubt we ever will anyway :-). Some other issues to be mindful of: - On Linux, paths are really bytes; on Windows (at least NTFS), they are really (16-bit) Unicode; on Mac, they are UTF-8 in a specific normal form (except on some external filesystems). - On Windows, short names are still supported, making the number of ways to spell the path for any given file even larger. -- --Guido van Rossum (python.org/~guido)

Le Thu, 11 Apr 2013 15:42:00 -0700, Guido van Rossum <guido@python.org> a écrit :
Ok, I've taken a look at the code. Right now lower() is used for two purposes: 1. comparisons (__eq__ and __ne__) 2. globbing and matching While (1) could be dropped, for (2) I think we want glob("*.py") to find "SETUP.PY" under Windows. Anything else will probably be surprising to users of that platform.
pathlib is just relying on Python 3's sane handling of unicode paths (thanks to PEP 383). Bytes paths are never used internally.
- On Windows, short names are still supported, making the number of ways to spell the path for any given file even larger.
They are still supported but I doubt they are still relied on (long filenames appeared in Windows 95!). I think in common situations we can ignore their existence. Specialized tools like Mercurial may have to know that they exist, in order to manage potential collisions (but Mercurial isn't really the target audience for pathlib, and I don't think they would be interested in such an abstraction). Regards Antoine.

On 12 April 2013 09:39, Antoine Pitrou <solipsis@pitrou.net> wrote:
If glob("*.py") failed to find SETUP.PY on Windows, that would be a usability disaster. Too many tools still exist that mangle filename case for anything else to be acceptable. For an easy example, the standard Windows ssh client, putty, is distributed as PUTTY.EXE. shutil.which('putty') needs to find that file if it's to be of any practical use. For comparisons, I think naive Windows users would expect __eq__ comparisons to work case insensitively, but Windows users with any level of understanding of cross-platform portability issues would be comfortable with the idea that this is risky. Having said that, currently there aren't any "pathname comparisons" as such, just string comparisons which "clearly" need application handling. In all honesty, I don't think that equality comparison for path *objects* (as opposed to "pathnames" as strings) is necessarily even well defined. If someone has two path objects and tries to compare them for equality, my first question would be whether that's really what they want to do... (But case-sensitive comparison, with copious warnings, is probably a reasonable practical compromise). Paul

On 12 Apr, 2013, at 10:39, Antoine Pitrou <solipsis@pitrou.net> wrote:
Globbing necessarily accesses the filesystem and could in theory do the right thing, except for the minor detail of there not being an easy way to determine of the names in a particular folder are compared case sensitive or not.
At least for OSX the kernel will normalize names for you, at least for HFS+, and therefore two names that don't compare equal with '==' can refer to the same file (for example the NFKD and NFKC forms of Löwe). Isn't unicode fun :-) Ronald

Am 12.04.2013 14:43, schrieb Ronald Oussoren:
Seriously, the OSX kernel normalizes unicode forms? It's a cool feature and makes sense for the user's POV but ... WTF? Perhaps we should use the platform's API for the job. Does OSX offer an API function to create a case folded and canonical form of a path? Windows has PathCchCanonicalizeEx(). Christian

On 12 Apr, 2013, at 15:00, Christian Heimes <christian@python.org> wrote:
IIRC only for HFS+ filesystems, it is possible to access files on an NFS share where the filename encoding isn't UTF-8.
This would have to be done on a per path element case, because every directory in a file's path could be on a separate filesystem with different conventions (HFS+, HFS+ case sensitive, NFS mounted unix filesystem). I have found sample code that can determine if a directory is on a case sensitive filesystem (attached to <http://lists.apple.com/archives/darwin-dev/2007/Apr/msg00036.html>, doesn't work in a 64-binary but I haven't check yet why is doesn't work there). I don'tknow if there is a function to determine the filesystem encoding, I guess assuming that the special casing is only needed for HFS+ variants could work but I'd have test that. Ronald

Le Fri, 12 Apr 2013 14:43:42 +0200, Ronald Oussoren <ronaldoussoren@mac.com> a écrit :
It's also much less efficient, since you have to stat() every potential match. e.g. when encountering "SETUP.PY", you would have to stat() (or, rather, lstat()) both "setup.py" and "SETUP.PY" to check if they have the same st_ino.
I don't think differently normalized filenames are as common on OS X as differently cased filenames are on Windows, right? Regards Antoine.

On 12 Apr, 2013, at 16:59, Antoine Pitrou <solipsis@pitrou.net> wrote:
I found a way to determine if names in a directory are stored case sensitive, at least on OSX. That way you'd only have to perform one call for the directory, or one call per path element that contains wildcard characters for glob.glob. That API is definitly platform specific.
The problem is more that HFS+ stores names with decomposed characters, which basicly means that accents are stored separate from their base characters. In most input the accented character will be one character, and hence a naieve comparison like this could fail to match: .> name = input() .> for fn in os.listdir('.'): .> if fn.lower() == name.lower(): .> print("Found {} in the current directory".format(name)) Ronald

On Fri, Apr 12, 2013 at 1:39 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yeah, I suppose so. But there are more crazy details. E.g. IIRC Windows silently ignores trailing dots in filenames. Do we want "*.py." to match SETUP.PY then?
I suppose that just leaves Unicode normalization, discussed later in the thread.
Actually, I've heard of code that dynamically falls back on short names when paths using long names exceed the system limit for path length (either 256 or 1024 IIRC). But short names pretty much require consulting the filesystem, so we can probably ignore them. I guess the bottom line is that, no matter how hard pathlib tries, apps cannot always rely on the predictions about filename validity or equivalence made by pathlib -- we'll have to document that it may be wrong, even though we have the moral obligation to make sure that it is right as often as possible. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum <guido@python.org> writes:
The limit is 260 characters. But longer paths can be handled by prepending \\?\ and using the unicode APIs. see http://msdn.microsoft.com/en-us/library/aa365247.aspx#maxpath we have the following code to handle the above insanity: ,---- | def prepend_magic_win32(path): | assert isinstance(path, unicode), "path must be of type unicode" | | if path.startswith(u"\\\\"): | if path.startswith(u"\\\\?\\"): | return path | else: | return u"\\\\?\\UNC\\" + path[2:] | else: | return u"\\\\?\\" + path `---- -- Cheers Ralf

On Fri, 12 Apr 2013 19:42:25 +0200 Ralf Schmitt <ralf@systemexit.de> wrote:
Indeed. I thought I might use them by default in pathlib but there are other pains: notably, extended paths (those starting with \\?\) can only be absolute. So pathlib supports *passing* them explicitly (kind of, there are very few tests for them) but it doesn't constructs them implicitly. (as Dirkjan pointed out, Mercurial also has domain-specific code to handle Windows paths quirks; this is where I took the idea of having a is_reserved() method for NUL, CON, etc.) Regards Antoine.

On 04/12/2013 10:05 AM, Guido van Rossum wrote:
Someone who is fresher than I am at Windows programming should answer this, but AFAICT Win32 provides no API that will tell you if a particular filename / volume is case sensitive. The VOLUME2 structure from GetVolumeInfo doesn't report anything, and FindFirstFileEx provides a special flag for you to tell the OS (!) whether or not you want case-sensitive globbing. The closest I can get with my cursory browsing of MSDN is that you could infer case-sensitivity from the filesystem reported by GetVolumeInfo, but I doubt even that would be perfect. My only suggestion: lob the problem back into the user's lap, perhaps with something like pathlib.cs['/'] = True pathlib.cs['/mnt/samba-share'] = False
(long filenames appeared in Windows 95!).
That wasn't their first appearance; I'm pretty sure Windows NT 3.1 supported long filenames in 1992, and though I don't remember specifically it's possible NT 3.1 also supported long and short filenames for the same file. Windows 95 was the first appearance of VFAT, the clever hack adding support for long and short filenames to FAT filesystems. /arry

On 12/04/2013 22:15, Larry Hastings wrote:
I don't have web access at the moment to check but IIRC the GetVolumeInformation call does return an indicator of whether the volume is case-sensitive via the VOLUME_FLAG flag enum. At least, it claims to: I don't have access to a case-sensitive filesystem to check whether it's lying or not. TJG

On Thu, Apr 11, 2013 at 11:27 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hmm, I think I'm tending towards the latter right now.
You might also want to look at what Mercurial does. As a cross-platform filesystem-oriented tool, it has some interesting issues in the department of casefolding. Cheers, Dirkjan

On 11Apr2013 14:11, Guido van Rossum <guido@python.org> wrote: | Some of my Dropbox colleagues just drew my attention to the occurrence | of case folding in pathlib.py. Basically, case folding as an approach | to comparing pathnames is fatally flawed. The issues include: | | - most OSes these days allow the mounting of both case-sensitive and | case-insensitive filesystems simultaneously | | - the case-folding algorithm on some filesystems is burned into the | disk when the disk is formatted | | - case folding requires domain knowledge, e.g. turkish dotless I | | - normalization is a mess, even on OSX, where it's better defined than elsewhere Yes, but what's the use case? Specificly, _why_ are you comparing pathnames? To my mind case folding is just one mode of filename conflict. Surely there are others (forbidden characters in some domains, like colons; names significant only to a certain number of characters; an so forth). Thus: what specific problem are you case-folding to address? On a personal basis I would normally address this kind of thing with stat(), avoiding any app knowledge about pathname rules: does this path exist, or are these paths referencing the same file? But of course that doesn't solve the wider issue with Dropbox, where the rules differ per platform and where work can take place disparately on separate hosts. Imagining Dropbox, I'd guess there's a file tree in the backing store. What is its policy? Does it allow multiple files differing only by case? I can imagine that would be bad when the tree is presented on a case insensitive platform (eg Windows, default MacOSX). Taking the view that DropBox should avoid that situation (in what are doubtless several forms), does Dropbox pre-emptively prevent making files with specific names based on what is already in the store, or resolve them to the same object (hard link locally, or simply and less confusingly and more portably, diverting opens to the existing name like a CI filesystem would)? What about offline? That suggests that the forbidden modes should known to the Dropbox app too. Is this the use case for comparing filenames instead of just doing a stat() to the local filesystem or to the remote backing store (via a virtual stat, as it were)? What does Dropbox do if the local app is disabled and a user runs riot in the Dropbox directory, making conflicting names: allowed by the local FS but conflicting in the backing store or on other hosts? What does Dropbox do if a user makes conflicting files independently on different hosts, and then syncs? I just feel you've got a name conflist issue to resolve (and how that's done is partly just policy), and pathlib which offers some facilities related to that kind of thing. But a mismatch between what you actually need to do and what pathlib offers. Fixing your problem isn't necessarily a bugfix for pathlib. I think we need to know the wider context. Cheers, -- Cameron Simpson <cs@zip.com.au> I had a *bad* day. I had to subvert my principles and kowtow to an idiot. Television makes these daily sacrifices possible. It deadens the inner core of my being. - Martin Donovan, _Trust_

On Thu, Apr 11, 2013 at 4:09 PM, Cameron Simpson <cs@zip.com.au> wrote:
Um, this isn't about Dropbox. This is a warning against the inclusion of any behavior depending on case folding in pathlib, based on experience with case folding at Dropbox (both in the client and in the server).
Of course.
Thus: what specific problem are you case-folding to address?
Why Dropbox is folding case really doesn't matter. But we have to deal with it because users expect unreasonable things, such as having two files, "readme" and "README", on a Linux box, and then syncing both files to a box running Windows or OSX. (There are many other edge cases, most not involving Linux at all.) We can't always os os.stat() because some of this logic runs on a box where the files don't exist (e.g. the server, or the Linux box in the above example).
You seem to be completely misunderstanding me. I am not asking for help solving our problem. I am giving advice to avoid baking the wrong solution to this class of problems into a new stdlib module.
You got the basic idea, but we can't just refuse to sync files that might be problematic on some other box. Suppose someone is using Dropbox just as a backup service for their Linux box. They shouldn't have to worry about case clashes not being backed up.
We have lots of different solutions based on the specific situations.
Again, please don't try to solve our problem for us.
My reasoning is as follows. If pathlib supports functionality for checking whether two paths spelled differently point to the same file, users are going to rely on that functionality. But if the implementation is based on knowing how and when to case fold, it will definitely have bugs. So I am proposing to either remove that functionality, or to implement it by consulting the filesystem. Which of course has its own set of issues, for example if the file doesn't exist yet, but there are ways to deal with that too. -- --Guido van Rossum (python.org/~guido)

On 11Apr2013 16:23, Guido van Rossum <guido@python.org> wrote: | On Thu, Apr 11, 2013 at 4:09 PM, Cameron Simpson <cs@zip.com.au> wrote: | > On 11Apr2013 14:11, Guido van Rossum <guido@python.org> wrote: | > | Some of my Dropbox colleagues just drew my attention to the occurrence | > | of case folding in pathlib.py. Basically, case folding as an approach | > | to comparing pathnames is fatally flawed. [...] | > | > Yes, but what's the use case? Specificly, _why_ are you comparing pathnames? | | Um, this isn't about Dropbox. This is a warning against the inclusion | of any behavior depending on case folding in pathlib, based on | experience with case folding at Dropbox (both in the client and in the | server). Ah. That wasn't so apparent to me. I took you to have tripped over a specific problem that pathlib appeared to be missolving. I've always viewed path normalisation and its ilk as hazard prone and very context dependent, so I tend not to do it if I can help it. | You seem to be completely misunderstanding me. I am not asking for | help solving our problem. I am giving advice to avoid baking the wrong | solution to this class of problems into a new stdlib module. Ok, fine. [...snip lots of stuff now not relevant...] | My reasoning is as follows. If pathlib supports functionality for | checking whether two paths spelled differently point to the same file, | users are going to rely on that functionality. But if the | implementation is based on knowing how and when to case fold, it will | definitely have bugs. So I am proposing to either remove that | functionality, or to implement it by consulting the filesystem. Which | of course has its own set of issues, for example if the file doesn't | exist yet, but there are ways to deal with that too. Personally I'd be for removing it, or making the doco quite blunt about the many possible shortcomings of guessing whether two paths are the same thing without being able to stat() them. I'll back out now. Cheers, -- Cameron Simpson <cs@zip.com.au> Having been erased, The document you're seeking Must now be retyped. - Haiku Error Messages http://www.salonmagazine.com/21st/chal/1998/02/10chal2.html

On Thu, Apr 11, 2013 at 02:11:21PM -0700, Guido van Rossum <guido@python.org> wrote:
- the case-folding algorithm on some filesystems is burned into the disk when the disk is formatted
Into the partition, I guess, not the physical disc? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Fri, Apr 12, 2013 at 09:29:44AM +1200, Robert Collins <robertc@robertcollins.net> wrote:
Ah, I've completely forgotten about that one. I was thinking in terms of filesystems. Thank you for reminding! Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Thu, 11 Apr 2013 14:11:21 -0700 Guido van Rossum <guido@python.org> wrote:
The problem is that: - if you always make the comparison case-sensitive, you'll get false negatives - if you make the comparison case-insensitive under Windows, you'll get false positives My assumption was that, globally, the number of false positives in case (2) is much less than the number of false negatives in case (1). On the other hand, one could argue that all comparisons should be case-sensitive *and* the proper way to test for "identical" paths is to access the filesystem. Which makes me think, perhaps concrete paths should get a "samefile" method as in os.path.samefile(). Hmm, I think I'm tending towards the latter right now. Regards Antoine.

On Thu, Apr 11, 2013 at 2:27 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Python on OSX has been using (1) for a decade now without major problems. Perhaps it would be best if the code never called lower() or upper() (not even indirectly via os.path.normcase()). Then any case-folding and path-normalization bugs are the responsibility of the application, and we won't have to worry about how to fix the stdlib without breaking backwards compatibility if we ever figure out how to fix this (which I somehow doubt we ever will anyway :-). Some other issues to be mindful of: - On Linux, paths are really bytes; on Windows (at least NTFS), they are really (16-bit) Unicode; on Mac, they are UTF-8 in a specific normal form (except on some external filesystems). - On Windows, short names are still supported, making the number of ways to spell the path for any given file even larger. -- --Guido van Rossum (python.org/~guido)

Le Thu, 11 Apr 2013 15:42:00 -0700, Guido van Rossum <guido@python.org> a écrit :
Ok, I've taken a look at the code. Right now lower() is used for two purposes: 1. comparisons (__eq__ and __ne__) 2. globbing and matching While (1) could be dropped, for (2) I think we want glob("*.py") to find "SETUP.PY" under Windows. Anything else will probably be surprising to users of that platform.
pathlib is just relying on Python 3's sane handling of unicode paths (thanks to PEP 383). Bytes paths are never used internally.
- On Windows, short names are still supported, making the number of ways to spell the path for any given file even larger.
They are still supported but I doubt they are still relied on (long filenames appeared in Windows 95!). I think in common situations we can ignore their existence. Specialized tools like Mercurial may have to know that they exist, in order to manage potential collisions (but Mercurial isn't really the target audience for pathlib, and I don't think they would be interested in such an abstraction). Regards Antoine.

On 12 April 2013 09:39, Antoine Pitrou <solipsis@pitrou.net> wrote:
If glob("*.py") failed to find SETUP.PY on Windows, that would be a usability disaster. Too many tools still exist that mangle filename case for anything else to be acceptable. For an easy example, the standard Windows ssh client, putty, is distributed as PUTTY.EXE. shutil.which('putty') needs to find that file if it's to be of any practical use. For comparisons, I think naive Windows users would expect __eq__ comparisons to work case insensitively, but Windows users with any level of understanding of cross-platform portability issues would be comfortable with the idea that this is risky. Having said that, currently there aren't any "pathname comparisons" as such, just string comparisons which "clearly" need application handling. In all honesty, I don't think that equality comparison for path *objects* (as opposed to "pathnames" as strings) is necessarily even well defined. If someone has two path objects and tries to compare them for equality, my first question would be whether that's really what they want to do... (But case-sensitive comparison, with copious warnings, is probably a reasonable practical compromise). Paul

On 12 Apr, 2013, at 10:39, Antoine Pitrou <solipsis@pitrou.net> wrote:
Globbing necessarily accesses the filesystem and could in theory do the right thing, except for the minor detail of there not being an easy way to determine of the names in a particular folder are compared case sensitive or not.
At least for OSX the kernel will normalize names for you, at least for HFS+, and therefore two names that don't compare equal with '==' can refer to the same file (for example the NFKD and NFKC forms of Löwe). Isn't unicode fun :-) Ronald

Am 12.04.2013 14:43, schrieb Ronald Oussoren:
Seriously, the OSX kernel normalizes unicode forms? It's a cool feature and makes sense for the user's POV but ... WTF? Perhaps we should use the platform's API for the job. Does OSX offer an API function to create a case folded and canonical form of a path? Windows has PathCchCanonicalizeEx(). Christian

On 12 Apr, 2013, at 15:00, Christian Heimes <christian@python.org> wrote:
IIRC only for HFS+ filesystems, it is possible to access files on an NFS share where the filename encoding isn't UTF-8.
This would have to be done on a per path element case, because every directory in a file's path could be on a separate filesystem with different conventions (HFS+, HFS+ case sensitive, NFS mounted unix filesystem). I have found sample code that can determine if a directory is on a case sensitive filesystem (attached to <http://lists.apple.com/archives/darwin-dev/2007/Apr/msg00036.html>, doesn't work in a 64-binary but I haven't check yet why is doesn't work there). I don'tknow if there is a function to determine the filesystem encoding, I guess assuming that the special casing is only needed for HFS+ variants could work but I'd have test that. Ronald

Le Fri, 12 Apr 2013 14:43:42 +0200, Ronald Oussoren <ronaldoussoren@mac.com> a écrit :
It's also much less efficient, since you have to stat() every potential match. e.g. when encountering "SETUP.PY", you would have to stat() (or, rather, lstat()) both "setup.py" and "SETUP.PY" to check if they have the same st_ino.
I don't think differently normalized filenames are as common on OS X as differently cased filenames are on Windows, right? Regards Antoine.

On 12 Apr, 2013, at 16:59, Antoine Pitrou <solipsis@pitrou.net> wrote:
I found a way to determine if names in a directory are stored case sensitive, at least on OSX. That way you'd only have to perform one call for the directory, or one call per path element that contains wildcard characters for glob.glob. That API is definitly platform specific.
The problem is more that HFS+ stores names with decomposed characters, which basicly means that accents are stored separate from their base characters. In most input the accented character will be one character, and hence a naieve comparison like this could fail to match: .> name = input() .> for fn in os.listdir('.'): .> if fn.lower() == name.lower(): .> print("Found {} in the current directory".format(name)) Ronald

On Fri, Apr 12, 2013 at 1:39 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yeah, I suppose so. But there are more crazy details. E.g. IIRC Windows silently ignores trailing dots in filenames. Do we want "*.py." to match SETUP.PY then?
I suppose that just leaves Unicode normalization, discussed later in the thread.
Actually, I've heard of code that dynamically falls back on short names when paths using long names exceed the system limit for path length (either 256 or 1024 IIRC). But short names pretty much require consulting the filesystem, so we can probably ignore them. I guess the bottom line is that, no matter how hard pathlib tries, apps cannot always rely on the predictions about filename validity or equivalence made by pathlib -- we'll have to document that it may be wrong, even though we have the moral obligation to make sure that it is right as often as possible. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum <guido@python.org> writes:
The limit is 260 characters. But longer paths can be handled by prepending \\?\ and using the unicode APIs. see http://msdn.microsoft.com/en-us/library/aa365247.aspx#maxpath we have the following code to handle the above insanity: ,---- | def prepend_magic_win32(path): | assert isinstance(path, unicode), "path must be of type unicode" | | if path.startswith(u"\\\\"): | if path.startswith(u"\\\\?\\"): | return path | else: | return u"\\\\?\\UNC\\" + path[2:] | else: | return u"\\\\?\\" + path `---- -- Cheers Ralf

On Fri, 12 Apr 2013 19:42:25 +0200 Ralf Schmitt <ralf@systemexit.de> wrote:
Indeed. I thought I might use them by default in pathlib but there are other pains: notably, extended paths (those starting with \\?\) can only be absolute. So pathlib supports *passing* them explicitly (kind of, there are very few tests for them) but it doesn't constructs them implicitly. (as Dirkjan pointed out, Mercurial also has domain-specific code to handle Windows paths quirks; this is where I took the idea of having a is_reserved() method for NUL, CON, etc.) Regards Antoine.

On 04/12/2013 10:05 AM, Guido van Rossum wrote:
Someone who is fresher than I am at Windows programming should answer this, but AFAICT Win32 provides no API that will tell you if a particular filename / volume is case sensitive. The VOLUME2 structure from GetVolumeInfo doesn't report anything, and FindFirstFileEx provides a special flag for you to tell the OS (!) whether or not you want case-sensitive globbing. The closest I can get with my cursory browsing of MSDN is that you could infer case-sensitivity from the filesystem reported by GetVolumeInfo, but I doubt even that would be perfect. My only suggestion: lob the problem back into the user's lap, perhaps with something like pathlib.cs['/'] = True pathlib.cs['/mnt/samba-share'] = False
(long filenames appeared in Windows 95!).
That wasn't their first appearance; I'm pretty sure Windows NT 3.1 supported long filenames in 1992, and though I don't remember specifically it's possible NT 3.1 also supported long and short filenames for the same file. Windows 95 was the first appearance of VFAT, the clever hack adding support for long and short filenames to FAT filesystems. /arry

On 12/04/2013 22:15, Larry Hastings wrote:
I don't have web access at the moment to check but IIRC the GetVolumeInformation call does return an indicator of whether the volume is case-sensitive via the VOLUME_FLAG flag enum. At least, it claims to: I don't have access to a case-sensitive filesystem to check whether it's lying or not. TJG

On Thu, Apr 11, 2013 at 11:27 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Hmm, I think I'm tending towards the latter right now.
You might also want to look at what Mercurial does. As a cross-platform filesystem-oriented tool, it has some interesting issues in the department of casefolding. Cheers, Dirkjan

On 11Apr2013 14:11, Guido van Rossum <guido@python.org> wrote: | Some of my Dropbox colleagues just drew my attention to the occurrence | of case folding in pathlib.py. Basically, case folding as an approach | to comparing pathnames is fatally flawed. The issues include: | | - most OSes these days allow the mounting of both case-sensitive and | case-insensitive filesystems simultaneously | | - the case-folding algorithm on some filesystems is burned into the | disk when the disk is formatted | | - case folding requires domain knowledge, e.g. turkish dotless I | | - normalization is a mess, even on OSX, where it's better defined than elsewhere Yes, but what's the use case? Specificly, _why_ are you comparing pathnames? To my mind case folding is just one mode of filename conflict. Surely there are others (forbidden characters in some domains, like colons; names significant only to a certain number of characters; an so forth). Thus: what specific problem are you case-folding to address? On a personal basis I would normally address this kind of thing with stat(), avoiding any app knowledge about pathname rules: does this path exist, or are these paths referencing the same file? But of course that doesn't solve the wider issue with Dropbox, where the rules differ per platform and where work can take place disparately on separate hosts. Imagining Dropbox, I'd guess there's a file tree in the backing store. What is its policy? Does it allow multiple files differing only by case? I can imagine that would be bad when the tree is presented on a case insensitive platform (eg Windows, default MacOSX). Taking the view that DropBox should avoid that situation (in what are doubtless several forms), does Dropbox pre-emptively prevent making files with specific names based on what is already in the store, or resolve them to the same object (hard link locally, or simply and less confusingly and more portably, diverting opens to the existing name like a CI filesystem would)? What about offline? That suggests that the forbidden modes should known to the Dropbox app too. Is this the use case for comparing filenames instead of just doing a stat() to the local filesystem or to the remote backing store (via a virtual stat, as it were)? What does Dropbox do if the local app is disabled and a user runs riot in the Dropbox directory, making conflicting names: allowed by the local FS but conflicting in the backing store or on other hosts? What does Dropbox do if a user makes conflicting files independently on different hosts, and then syncs? I just feel you've got a name conflist issue to resolve (and how that's done is partly just policy), and pathlib which offers some facilities related to that kind of thing. But a mismatch between what you actually need to do and what pathlib offers. Fixing your problem isn't necessarily a bugfix for pathlib. I think we need to know the wider context. Cheers, -- Cameron Simpson <cs@zip.com.au> I had a *bad* day. I had to subvert my principles and kowtow to an idiot. Television makes these daily sacrifices possible. It deadens the inner core of my being. - Martin Donovan, _Trust_

On Thu, Apr 11, 2013 at 4:09 PM, Cameron Simpson <cs@zip.com.au> wrote:
Um, this isn't about Dropbox. This is a warning against the inclusion of any behavior depending on case folding in pathlib, based on experience with case folding at Dropbox (both in the client and in the server).
Of course.
Thus: what specific problem are you case-folding to address?
Why Dropbox is folding case really doesn't matter. But we have to deal with it because users expect unreasonable things, such as having two files, "readme" and "README", on a Linux box, and then syncing both files to a box running Windows or OSX. (There are many other edge cases, most not involving Linux at all.) We can't always os os.stat() because some of this logic runs on a box where the files don't exist (e.g. the server, or the Linux box in the above example).
You seem to be completely misunderstanding me. I am not asking for help solving our problem. I am giving advice to avoid baking the wrong solution to this class of problems into a new stdlib module.
You got the basic idea, but we can't just refuse to sync files that might be problematic on some other box. Suppose someone is using Dropbox just as a backup service for their Linux box. They shouldn't have to worry about case clashes not being backed up.
We have lots of different solutions based on the specific situations.
Again, please don't try to solve our problem for us.
My reasoning is as follows. If pathlib supports functionality for checking whether two paths spelled differently point to the same file, users are going to rely on that functionality. But if the implementation is based on knowing how and when to case fold, it will definitely have bugs. So I am proposing to either remove that functionality, or to implement it by consulting the filesystem. Which of course has its own set of issues, for example if the file doesn't exist yet, but there are ways to deal with that too. -- --Guido van Rossum (python.org/~guido)

On 11Apr2013 16:23, Guido van Rossum <guido@python.org> wrote: | On Thu, Apr 11, 2013 at 4:09 PM, Cameron Simpson <cs@zip.com.au> wrote: | > On 11Apr2013 14:11, Guido van Rossum <guido@python.org> wrote: | > | Some of my Dropbox colleagues just drew my attention to the occurrence | > | of case folding in pathlib.py. Basically, case folding as an approach | > | to comparing pathnames is fatally flawed. [...] | > | > Yes, but what's the use case? Specificly, _why_ are you comparing pathnames? | | Um, this isn't about Dropbox. This is a warning against the inclusion | of any behavior depending on case folding in pathlib, based on | experience with case folding at Dropbox (both in the client and in the | server). Ah. That wasn't so apparent to me. I took you to have tripped over a specific problem that pathlib appeared to be missolving. I've always viewed path normalisation and its ilk as hazard prone and very context dependent, so I tend not to do it if I can help it. | You seem to be completely misunderstanding me. I am not asking for | help solving our problem. I am giving advice to avoid baking the wrong | solution to this class of problems into a new stdlib module. Ok, fine. [...snip lots of stuff now not relevant...] | My reasoning is as follows. If pathlib supports functionality for | checking whether two paths spelled differently point to the same file, | users are going to rely on that functionality. But if the | implementation is based on knowing how and when to case fold, it will | definitely have bugs. So I am proposing to either remove that | functionality, or to implement it by consulting the filesystem. Which | of course has its own set of issues, for example if the file doesn't | exist yet, but there are ways to deal with that too. Personally I'd be for removing it, or making the doco quite blunt about the many possible shortcomings of guessing whether two paths are the same thing without being able to stat() them. I'll back out now. Cheers, -- Cameron Simpson <cs@zip.com.au> Having been erased, The document you're seeking Must now be retyped. - Haiku Error Messages http://www.salonmagazine.com/21st/chal/1998/02/10chal2.html
participants (13)
-
Antoine Pitrou
-
Cameron Simpson
-
Christian Heimes
-
Devin Jeanpierre
-
Dirkjan Ochtman
-
Guido van Rossum
-
Larry Hastings
-
Oleg Broytman
-
Paul Moore
-
Ralf Schmitt
-
Robert Collins
-
Ronald Oussoren
-
Tim Golden