[Python-Dev] casefolding in pathlib (PEP 428)

Antoine Pitrou solipsis at pitrou.net
Fri Apr 12 10:39:54 CEST 2013


Le Thu, 11 Apr 2013 15:42:00 -0700,
Guido van Rossum <guido at python.org> a écrit :
> On Thu, Apr 11, 2013 at 2:27 PM, Antoine Pitrou <solipsis at pitrou.net>
> wrote:
> > On Thu, 11 Apr 2013 14:11:21 -0700
> > Guido van Rossum <guido at python.org> wrote:
> >> Hey Antoine,
> >>
> >> Some of my Dropbox colleagues just drew my attention to the
> >> occurrence of case folding in pathlib.py. Basically, case folding
> >> as an approach to comparing pathnames is fatally flawed. The
> >> issues include:
> >>
> >> - most OSes these days allow the mounting of both case-sensitive
> >> and case-insensitive filesystems simultaneously
> >>
> >> - the case-folding algorithm on some filesystems is burned into the
> >> disk when the disk is formatted
> >
> > The problem is that:
> > - if you always make the comparison case-sensitive, you'll get false
> >   negatives
> > - if you make the comparison case-insensitive under Windows, you'll
> > get false positives
> >
> > My assumption was that, globally, the number of false positives in
> > case (2) is much less than the number of false negatives in case
> > (1).
> >
> > On the other hand, one could argue that all comparisons should be
> > case-sensitive *and* the proper way to test for "identical" paths
> > is to access the filesystem. Which makes me think, perhaps concrete
> > paths should get a "samefile" method as in os.path.samefile().
> >
> > Hmm, I think I'm tending towards the latter right now.
> 
> Python on OSX has been using (1) for a decade now without major
> problems.
> 
> Perhaps it would be best if the code never called lower() or upper()
> (not even indirectly via os.path.normcase()). Then any case-folding
> and path-normalization bugs are the responsibility of the application,
> and we won't have to worry about how to fix the stdlib without
> breaking backwards compatibility if we ever figure out how to fix this
> (which I somehow doubt we ever will anyway :-).

Ok, I've taken a look at the code. Right now lower() is used for two
purposes:

1. comparisons (__eq__ and __ne__)
2. globbing and matching

While (1) could be dropped, for (2) I think we want glob("*.py") to find
"SETUP.PY" under Windows. Anything else will probably be surprising to
users of that platform.

> - On Linux, paths are really bytes; on Windows (at least NTFS), they
> are really (16-bit) Unicode; on Mac, they are UTF-8 in a specific
> normal form (except on some external filesystems).

pathlib is just relying on Python 3's sane handling of unicode paths
(thanks to PEP 383). Bytes paths are never used internally.

> - On Windows, short names are still supported, making the number of
> ways to spell the path for any given file even larger.

They are still supported but I doubt they are still relied on (long
filenames appeared in Windows 95!). I think in common situations we can
ignore their existence. Specialized tools like Mercurial may have to
know that they exist, in order to manage potential collisions (but
Mercurial isn't really the target audience for pathlib, and I don't
think they would be interested in such an abstraction).

Regards

Antoine.




More information about the Python-Dev mailing list