[Python-Dev] casefolding in pathlib (PEP 428)

Guido van Rossum guido at python.org
Fri Apr 12 01:23:47 CEST 2013


On Thu, Apr 11, 2013 at 4:09 PM, Cameron Simpson <cs at zip.com.au> wrote:
> On 11Apr2013 14:11, Guido van Rossum <guido at python.org> wrote:
> | Some of my Dropbox colleagues just drew my attention to the occurrence
> | of case folding in pathlib.py. Basically, case folding as an approach
> | to comparing pathnames is fatally flawed. The issues include:
> |
> | - most OSes these days allow the mounting of both case-sensitive and
> | case-insensitive filesystems simultaneously
> |
> | - the case-folding algorithm on some filesystems is burned into the
> | disk when the disk is formatted
> |
> | - case folding requires domain knowledge, e.g. turkish dotless I
> |
> | - normalization is a mess, even on OSX, where it's better defined than elsewhere
>
> Yes, but what's the use case? Specificly, _why_ are you comparing pathnames?

Um, this isn't about Dropbox. This is a warning against the inclusion
of any behavior depending on case folding in pathlib, based on
experience with case folding at Dropbox (both in the client and in the
server).

> To my mind case folding is just one mode of filename conflict.
> Surely there are others (forbidden characters in some domains, like
> colons; names significant only to a certain number of characters;
> an so forth).

Of course.

> Thus: what specific problem are you case-folding to address?

Why Dropbox is folding case really doesn't matter. But we have to deal
with it because users expect unreasonable things, such as having two
files, "readme" and "README", on a Linux box, and then syncing both
files to a box running Windows or OSX. (There are many other edge
cases, most not involving Linux at all.) We can't always os os.stat()
because some of this logic runs on a box where the files don't exist
(e.g. the server, or the Linux box in the above example).

> On a personal basis I would normally address this kind of thing
> with stat(), avoiding any app knowledge about pathname rules: does
> this path exist, or are these paths referencing the same file? But
> of course that doesn't solve the wider issue with Dropbox, where
> the rules differ per platform and where work can take place disparately
> on separate hosts.

You seem to be completely misunderstanding me. I am not asking for
help solving our problem. I am giving advice to avoid baking the wrong
solution to this class of problems into a new stdlib module.

> Imagining Dropbox, I'd guess there's a file tree in the backing store.
> What is its policy? Does it allow multiple files differing only by case?
> I can imagine that would be bad when the tree is presented on a case
> insensitive platform (eg Windows, default MacOSX).

You got the basic idea, but we can't just refuse to sync files that
might be problematic on some other box. Suppose someone is using
Dropbox just as a backup service for their Linux box. They shouldn't
have to worry about case clashes not being backed up.

> Taking the view that DropBox should avoid that situation (in what
> are doubtless several forms), does Dropbox pre-emptively prevent
> making files with specific names based on what is already in the
> store, or resolve them to the same object (hard link locally, or
> simply and less confusingly and more portably, diverting opens to
> the existing name like a CI filesystem would)?

We have lots of different solutions based on the specific situations.

> What about offline? That suggests that the forbidden modes should
> known to the Dropbox app too. Is this the use case for comparing
> filenames instead of just doing a stat() to the local filesystem
> or to the remote backing store (via a virtual stat, as it were)?

Again, please don't try to solve our problem for us.

> What does Dropbox do if the local app is disabled and a user runs
> riot in the Dropbox directory, making conflicting names: allowed
> by the local FS but conflicting in the backing store or on other
> hosts?
>
> What does Dropbox do if a user makes conflicting files independently
> on different hosts, and then syncs?
>
> I just feel you've got a name conflist issue to resolve (and how
> that's done is partly just policy), and pathlib which offers some
> facilities related to that kind of thing. But a mismatch between
> what you actually need to do and what pathlib offers.
>
> Fixing your problem isn't necessarily a bugfix for pathlib.
>
> I think we need to know the wider context.

My reasoning is as follows. If pathlib supports functionality for
checking whether two paths spelled differently point to the same file,
users are going to rely on that functionality. But if the
implementation is based on knowing how and when to case fold, it will
definitely have bugs. So I am proposing to either remove that
functionality, or to implement it by consulting the filesystem. Which
of course has its own set of issues, for example if the file doesn't
exist yet, but there are ways to deal with that too.

--
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list