[Python-Dev] casefolding in pathlib (PEP 428)

Guido van Rossum guido at python.org
Fri Apr 12 00:42:00 CEST 2013


On Thu, Apr 11, 2013 at 2:27 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Thu, 11 Apr 2013 14:11:21 -0700
> Guido van Rossum <guido at python.org> wrote:
>> Hey Antoine,
>>
>> Some of my Dropbox colleagues just drew my attention to the occurrence
>> of case folding in pathlib.py. Basically, case folding as an approach
>> to comparing pathnames is fatally flawed. The issues include:
>>
>> - most OSes these days allow the mounting of both case-sensitive and
>> case-insensitive filesystems simultaneously
>>
>> - the case-folding algorithm on some filesystems is burned into the
>> disk when the disk is formatted
>
> The problem is that:
> - if you always make the comparison case-sensitive, you'll get false
>   negatives
> - if you make the comparison case-insensitive under Windows, you'll get
>   false positives
>
> My assumption was that, globally, the number of false positives in case
> (2) is much less than the number of false negatives in case (1).
>
> On the other hand, one could argue that all comparisons should be
> case-sensitive *and* the proper way to test for "identical" paths is to
> access the filesystem. Which makes me think, perhaps concrete paths
> should get a "samefile" method as in os.path.samefile().
>
> Hmm, I think I'm tending towards the latter right now.

Python on OSX has been using (1) for a decade now without major problems.

Perhaps it would be best if the code never called lower() or upper()
(not even indirectly via os.path.normcase()). Then any case-folding
and path-normalization bugs are the responsibility of the application,
and we won't have to worry about how to fix the stdlib without
breaking backwards compatibility if we ever figure out how to fix this
(which I somehow doubt we ever will anyway :-).

Some other issues to be mindful of:

- On Linux, paths are really bytes; on Windows (at least NTFS), they
are really (16-bit) Unicode; on Mac, they are UTF-8 in a specific
normal form (except on some external filesystems).

- On Windows, short names are still supported, making the number of
ways to spell the path for any given file even larger.

--
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list