[Python-Dev] casefolding in pathlib (PEP 428)

Ronald Oussoren ronaldoussoren at mac.com
Fri Apr 12 17:29:58 CEST 2013


On 12 Apr, 2013, at 16:59, Antoine Pitrou <solipsis at pitrou.net> wrote:

> Le Fri, 12 Apr 2013 14:43:42 +0200,
> Ronald Oussoren <ronaldoussoren at mac.com> a écrit :
>> 
>> On 12 Apr, 2013, at 10:39, Antoine Pitrou <solipsis at pitrou.net> wrote:
>>>> 
>>>> 
>>>> Perhaps it would be best if the code never called lower() or
>>>> upper() (not even indirectly via os.path.normcase()). Then any
>>>> case-folding and path-normalization bugs are the responsibility of
>>>> the application, and we won't have to worry about how to fix the
>>>> stdlib without breaking backwards compatibility if we ever figure
>>>> out how to fix this (which I somehow doubt we ever will anyway :-).
>>> 
>>> Ok, I've taken a look at the code. Right now lower() is used for two
>>> purposes:
>>> 
>>> 1. comparisons (__eq__ and __ne__)
>>> 2. globbing and matching
>>> 
>>> While (1) could be dropped, for (2) I think we want glob("*.py") to
>>> find "SETUP.PY" under Windows. Anything else will probably be
>>> surprising to users of that platform.
>> 
>> Globbing necessarily accesses the filesystem and could in theory do
>> the right thing, except for the minor detail of there not being an
>> easy way to determine of the names in a particular folder are
>> compared case sensitive or not. 
> 
> It's also much less efficient, since you have to stat() every potential
> match. e.g. when encountering "SETUP.PY", you would have to stat() (or,
> rather, lstat()) both "setup.py" and "SETUP.PY" to check if they have
> the same st_ino.

I found a way to determine if names in a directory are stored case sensitive,
at least on OSX. That way you'd only have to perform one call for the directory,
or one call per path element that contains wildcard characters for glob.glob.

That API is definitly platform specific.

> 
>> At least for OSX the kernel will normalize names for you, at least
>> for HFS+, and therefore two names that don't compare equal with '=='
>> can refer to the same file (for example the NFKD and NFKC forms of
>> Löwe). 
> 
> I don't think differently normalized filenames are as common on OS X as
> differently cased filenames are on Windows, right?

The problem is more that HFS+ stores names with decomposed characters,
which basicly means that accents are stored separate from their base
characters. In most input the accented character will be one character,
and hence a naieve comparison like this could fail to match:

.> name = input()
.> for fn in os.listdir('.'):
.>   if fn.lower() == name.lower():
.>      print("Found {} in the current directory".format(name))

Ronald

> 
> Regards
> 
> Antoine.
> 
> 
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com



More information about the Python-Dev mailing list