Re: [Python-ideas] Small enhancement to os.path.splitext

When would you know in advance you want 2 parts? Compare file.tar.gz to backup.2010.04.20.tar. I'd think the usual case would be split if gz or zip or...: split again --- Bruce (via android) On Apr 20, 2010 8:36 AM, "Tarek Ziadé" <ziade.tarek@gmail.com> wrote: Hello Currently, os.path.splitext will split a string giving you the piece behind the last dot:
os.path.splitext('file.tar.gz') ('file.tar', '.gz')
In some cases, what we really want is the two last parts when splitting on the dots (like in my example). What about providing an extra argument to be able to grab more than one dot ?
os.path.splitext('file.tar.gz', numext=2) ('file', '.tar.gz')
If numext > numbers of dots, it will just split after the first dot:
os.path.splitext('file.tar', numext=2) ('file', '.tar')
What do you think ? Regards Tarek -- Tarek Ziadé | http://ziade.org _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas

On Tue, Apr 20, 2010 at 5:48 PM, Bruce Leban <bruce@leapyear.org> wrote:
Right, that's from the user point of view, and that's why I was thinking about numext in the first place, to be able to do what you describe (a loop with a split position that moves) my undertsanding now is that we would need to iterate from the longest-match to the shortest-match, until we find a pattern that works, so numext should be done from the left to the right. Or maybe simply drop that idea and just use path.endswith(extension)... :) In that case the only subtle case is when the filename starts with '.', So what about a new API : os.path.hasext(path, extensions) That would return True if the path match on extension provided in the extensions sequence (using .endswith, but ignoring the first dot if the filename starts with a dot) Tarek -- Tarek Ziadé | http://ziade.org

On 20 April 2010 17:10, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
That should take into account filesystem case sensitivity if it's to meet user expectations. Which is a whole can of worms that you probably don't want to open - certainly not for the standard library without expecting a lot of work! The current stdlib functions - all of which simply split up the filename - avoid the problem, leaving it to the application to address case sensitivity issues. (Of course, if you want to start down this route, I'd suggest you start with a function which, given a directory name, determines if that directory's treats file names within it as case sensitive or not. I'm not even sure if this is possible to implement in any sane way, but it would be the key building block for any other path matching code, such as your os.path.hasext proposal). Paul.

On Tue, Apr 20, 2010 at 6:25 PM, Paul Moore <p.f.moore@gmail.com> wrote:
I am not sure to follow the issue. Do you mean that in the same filesystem, each directory can treat case sensivity differently ? I wasn't aware of that. How would this affect the extension btw ? I can imagine that the path+extensions could to be normalized before the matching job, but I don't see other issue, do you have an example ?
Paul.
-- Tarek Ziadé | http://ziade.org

On Tue, Apr 20, 2010 at 12:47 PM, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
I am not sure to follow the issue. Do you mean that in the same filesystem, each directory can treat case sensivity differently ? I wasn't aware of that.
I've never seen a filesystem that doesn't treat case-sensitivity consistently, regardless of path. But a single hierarchy composed of multiple filesystems may have different behaviors in different directories, because they're mounted from different filesystems.
I suspect a substantial part of the problem is really that Python doesn't expose an API to normalize a path based on it's actual location in the hierarchy; the operating system is used, but nothing that deals with a path on a mounted filesystem that has non-default behavior when compared to the traditional OS-centric expectations. For a specific example, consider mounting a case-insensitive HFS filesystem from Unix (well, most current Unixes): If the HFS is mounted at /MyHfsMount (on an ext3 filesystem), and contains a hierarchy /Folder/SomeFile.txt, the file would have an absolute path of /MyHfsMount/Folder/SomeFile.txt But the normalized path should be: /MyHfsMount/folder/somefile.txt (Note that each path segment is normalized according to the filesystem it's on, rather than the running OS.) -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Chaos is the score upon which reality is written." --Henry Miller

On Tue, 20 Apr 2010 18:47:50 +0200 Tarek Ziadé <ziade.tarek@gmail.com> wrote:
I don't think so. On the other hand, with modern file systems, if I need a directory that is case-insensitive on a file system that's case-sensitive, I'll just create a case-sensitive file system at that point. Doesn't even require root privs these days.
This gets back to user expectations - if foo.c and foo.C are different files, then foo.c doesn't have a .C extension. If they aren't different file, then it does. Which means you have to know whether files are being treated in a case insensitive manner or not before you can normalize the names. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 20 April 2010 17:47, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
I am not sure to follow the issue. Do you mean that in the same filesystem, each directory can treat case sensivity differently ? I wasn't aware of that.
Apologies, I was unclear (I was trying to avoid the term "filesystem" which isn't an obvious concept on Windows). Putting it more simply, you need to be able to determine the case sensitivity of the filesystem to determine if a file "has extension .xxx".
As Mike Meyer explained, whether file foo.C "has extension .c" depends (as far as the user is concerned) on whether the filesystem it's on is case sensitive. Many unix utility ports on Windows don't consider this, and they are quite annoying when the problem hits you (which admittedly isn't that often). Having an os.path.has_extension function in Python which doesn't consider case sensitivity would be an attractive nuisance, encouraging people to write subtly broken code like this. I'm -1 on such a function. Having said that, if you can write a function which detects filesystem case sensitivity (and write a "correct" has_extension function based on it), you'd be my hero (I believe it's a pretty difficult issue, if not impossible in the general case). Paul.

On Tue, Apr 20, 2010 at 11:42 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Ok well, I doubt I'll be your hero, this is way over my head ;) The only simple way that comes in my mind is to write a test file to try it out but that's a hack But being able to detect it sounds like something we should definitely have in os/os.path.
Paul.
-- Tarek Ziadé | http://ziade.org

On 20 April 2010 22:55, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
Ok well, I doubt I'll be your hero, this is way over my head ;)
:-)
Actually, I just dis a bit of digging - on Windows, GetVolumeInformation, or GetVolumeInformationByHandleW looks like it'd do the job. Does anyone know how to find out if a filesystem is case sensitive on Linux/MacOS/Unix...? Paul.

Paul Moore wrote:
I don't think that's really true. It's common to find, e.g., files ending with .jpg, .JPG, or other variations on a case-sensitive filesystem that got there by being copied from a case-insensitive one, or simply created by people using tools that don't care about the case of the extension. Seems to me the best thing to do is always compare extensions case-insensitively unless you have a specific reason to do otherwise. So I would recommend that any proposed hasextension() function should be case-insensitive by default. -- Greg

On Tue, Apr 20, 2010 at 8:18 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I've certainly seen cases where the case was relevant in order to determine how a file was handled. There have been compilers that interpreted .c as C and .C as C++. (Not that better options weren't available, but that was part of the default interpretation.) This was GCC, so... not particularly obscure. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Chaos is the score upon which reality is written." --Henry Miller

On Wed, 21 Apr 2010 12:18:41 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
That's probably true for the vast majority of extensions, but I chose .C vs. .c for a reason - gcc thinks they are different: bhuda% cmp bunny.c bunny.C bhuda% file bunny.? bunny.C: ASCII C program text bunny.c: ASCII C program text bhuda% gcc bunny.c bhuda% g++ bunny.c bunny.c: In function 'int main(int, char**)': bunny.c:25: error: 'fork' was not declared in this scope bunny.c:26: error: 'wait' was not declared in this scope bhuda% gcc bunny.C bunny.C: In function 'int main(int, char**)': bunny.C:25: error: 'fork' was not declared in this scope bunny.C:26: error: 'wait' was not declared in this scope bhuda%
Reasonable, but if you have a way to make it case-sensitive, you're back where we started from: needing to figure out whether the files in question care about case. Given the number of options here - including that the file may not exist yet, and ditto for the file system it's going to be on - possibly the solution is to leave off that option, and document that this case has to be dealt with by the caller. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 21/04/10 12:47, Mike Meyer wrote:
It's not whether the *files* care about case, it's whether the *application* cares about case. For example, an application that deals with image files should probably recognise .jpg, .JPG, .Jpg, etc. as all meaning the same thing. An application that deals with source files, on the other hand, will probably want to distinguish between .c and .C. So I think it's likely that in any particular case, the caller of hasextension() will know whether he wants case-sensitivity or not. -- Greg

On 21 April 2010 02:33, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
True, but equally, the natural reaction for any but the most careful of programmers would be to say something like hasextension(fn, '.txt') without thinking. Hence my assertion that it's an attractive nuisance. In practice, it's no worse than splitext(fn) == '.txt', but it gives the impression that it should do a better job (even though, as you say, it can't). Once you get beyond the most basic operations, filename handling is hard. (I haven't even touched on MacOS unicode normalisation issues yet...) Getting it 90% right is easy (99% if you cover case sensitivity) - it's that last 1% that bites. (And yes, the 90%/99% split is, in my view, about accurate). Paul.

Paul Moore wrote:
That's why I think that case-insensitivity should be the default. Most of the time, people don't use case alone to distinguish between different file types. It's only in rare situations such as .c vs .C that the case matters, and in those situations, you know that you're dealing with something special. In any case, I don't believe it has anything to do with the *filesystem* on which the file happens to reside. As long as the filesystem is case-preserving, an application can regard file extensions as being case-sensitive or not as it pleases. -- Greg

On Wed, Apr 21, 2010 at 6:10 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
As long as the filesystem is case-preserving
But that's exactly the catch; you see things like .JPG and .GIF primarily on files that are copied from case-insensitive filesystems, and, less often, from software that hasn't been updated since those were pervasive on the operating system(s) the software was intended for (commonly written by programmers without experience with case-preserving systems). As such legacy fades (however slowly), should we *increase* the amount of code that deals with it, or should we move on? Keep in mind that with the advent of Python 3, we're putting a lot of effort into shedding much more recent "legacy". -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Chaos is the score upon which reality is written." --Henry Miller

On 21 April 2010 11:58, Fred Drake <fdrake@acm.org> wrote:
As such legacy fades (however slowly), should we *increase* the amount of code that deals with it, or should we move on?
It's not clear to me that it's all "legacy". I had the impression that MacOS used a case insensitive filesystem - is that right? Certainly, I know that MacOS uses Unicode normalisation, so simple string comparison is definitely not correct. (I only use MacOS as an example to avoid the assumption that this is a Windows-only issue - there are also case-insensitive filesystems available for Unix in general, if nothing else SMBFS). Certainly, non-case-preserving systems are a dying breed. But from what I've read, the situation around case sensitivity is far less clear - I've even seen comments from Unix users that claim they would like Unix to move away from case sensitivity. (I'm not qualified to judge the relevance or importance of such claims, I merely note that they exist). Like it or not, treating filenames as uninterpreted byte strings will violate some users' expectations. Library support for dealing with user expectations at a higher level would be good. Unfortunately, from my reading, it seems like at least on Linux, filesystem writers are expected to implement their own path handling (at least, that's how I interpret the FUSE documentation, and I couldn't find any lower level filesystem driver documentation that implied otherwise). So, at the filesystem level, it seems that on Linux all bets are off - I could, in theory, write a filesystem that was case insensitive for consonants, but case sensitive for vowels. And stored filenames in bit-reversed UTF-8. Bleh. Paul.

On Wed, 21 Apr 2010 16:06:01 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
Half right: MacOS has a couple of native file systems, and they can all be either case-sensitive or case-insensitive. Most Mac applications now work correctly on case-sensitive file systems, so you can safely make all your Mac file systems case-sensitive, which wasn't always the case. If you're using unix software - well, it's not all case-insensitive friendly. zfs file systems - used on Solaris and BSD, and available though orphaned for MacOS - also consider case sensitivity optional.
Unicode normalization is an option for zfs file systems. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 04/22/10 01:24, Mike Meyer wrote:
how about: def hasextension(filename, ext, smartness=0): # when smartness = 0 filename is treated as uninterpreted bytestring # when smartness = 10, filename will be: # - unicode normalization # - case normalization # - accent normalization # - synonym normalization # - language normalization (have filename in French?) # - mistyping normalization # - extension normalization ('.jpg' matches '.jpg' and '.jpeg') # - wrapper normalization (.tgz is normalized to .tar.gz) # - filetype normalization (.mpg matches .mov, .avi, .ogg, etc) ... magic code follows ... :-)

On Wed, 21 Apr 2010 13:33:30 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Actually, it's the file system. The *application* case is dealt with by the option of doing a case-insensitive test. If you turn that off, then you have to figure out whether or not the file system is case insensitive to do it right.
Right. I'm talking about the case where the developer knows he wants case sensitivity. In that case, you have to know whether or not the file system is case sensitive to know whether or not "file.c" would get opened if the application tried to open "file.C". <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Wed, Apr 21, 2010 at 10:19 AM, Mike Meyer <mwm-keyword-python.b4bdba@mired.org> wrote:
Right. I'm talking about the case where the developer knows he wants case sensitivity.
And can therefore call a different library function, or pass an optional flag argument.
uhm ... Why? If it would, there really isn't any way to tell what the "real" name is ... or do you just mean that it should cease to be case-sensitive in exactly those cases where file.c is file.C, as a way of breaking less than it otherwise would? -jJ

Mike Meyer wrote:
What? I thought the use case we were talking about is where you have a filename and you want to make a guess about what kind of data it contains, based on the extension. I don't think I've ever wanted to find out whether a file will get opened if I use the wrong case. I just trust that the filename I'm given is suitable for passing to open(), whatever case it might have. I also can't think why you might be worried about that just for the extension, and not the rest of the pathname. -- Greg

On 22 April 2010 00:09, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I believe that the use case being discussed goes as follows: 1. User supplies name FILE.C to the program 2. Program has different behaviours for .c and .C files 3. Program needs to decide whether to use the .c or .C behaviour Option 1 is to use .C always on the basis that you believe that the user should expect case sensitive behaviour. Option 2 is to acknowledge that if the filesystem is case insensitive, the user will expect the same behaviour for both .c and .C files (assume .c is the "obvious" one) and do that. Note that with option 2, the user isn't necessarily being perverse by expecting .c behaviour. Maybe a legacy program created the file, using the all-uppercase name, and the user didn't (want to, remember to) change it. Maybe the user had caps lock on by accident, and doesn't want to waste time correcting the whole line. The point is that users using a case insensitive filesystem have a reasonable expectation that that programs will ignore the case of filenames (even in cases where they don't have to do so). Violating that expectation is bad. A second use case: 1. User has a directory containing a number of temporary files, all with extension .tmp (but in varying case) 2. User has a "cleanup" program which deletes temporary files The user will quite reasonably expect the cleanup program to delete all files with extension .tmp, regardless of the case. Essentially, in this simplified example, any behaviour that differs from DEL *.TMP would be considered a bug. Hope this clarifies things a bit. Paul

On Thu, 22 Apr 2010 11:09:53 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
How are these different? If the user typed the filename "foo.c" and I am trying to decide if it has the ".C" extension. If "foo.C" exists on the disk and the user knows that "foo.c" and "foo.C" are the same file, it's reasonable for the user to expect the application to figure out that this file has the ".C" extension, even though they typed ".c". So whether or not the comparison should be case sensitive depends on whether or not the file system is case sensitive. Likewise, if we're *saving* the file, and the user expects us to automatically add the right extension, then if the file system is case sensitive, ".c" is the wrong extension. If the file system is case insensitive, it's not clear to me what the user might expect. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer wrote:
But you don't know whether the user is expecting this. If he knows that gcc distinguishes between .c and .C, it may be that he is expecting foo to be compiled as a C file rather than C++, and he made a mistake when he named the file foo.C. -- Greg

On Fri, 23 Apr 2010 12:33:10 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Given that the argument for making the test case-insensitive to start with was that the user expects .JPG and .jpg to be the same, I think you just shot down the case for making the test case-insensitive as well. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer wrote:
I only said it should be case-sensitive as a *default*. The .c/.C thing is a special case, which the application can deal with by explicitly requesting a case-sensitive comparison. Most of the time you'll be dealing with things like .jpg/.JPG, where I don't think any harm could be caused by treating the extensions case-insensitively always. I still don't think you can decide based on the file system behaviour. What if the user creates his files on a case-insensitive system and then copies them to a case-sensitive one or vice versa? The intended meaning of the extensions doesn't suddenly change when that happens. -- Greg

On Tue, Apr 20, 2010 at 5:48 PM, Bruce Leban <bruce@leapyear.org> wrote:
Right, that's from the user point of view, and that's why I was thinking about numext in the first place, to be able to do what you describe (a loop with a split position that moves) my undertsanding now is that we would need to iterate from the longest-match to the shortest-match, until we find a pattern that works, so numext should be done from the left to the right. Or maybe simply drop that idea and just use path.endswith(extension)... :) In that case the only subtle case is when the filename starts with '.', So what about a new API : os.path.hasext(path, extensions) That would return True if the path match on extension provided in the extensions sequence (using .endswith, but ignoring the first dot if the filename starts with a dot) Tarek -- Tarek Ziadé | http://ziade.org

On 20 April 2010 17:10, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
That should take into account filesystem case sensitivity if it's to meet user expectations. Which is a whole can of worms that you probably don't want to open - certainly not for the standard library without expecting a lot of work! The current stdlib functions - all of which simply split up the filename - avoid the problem, leaving it to the application to address case sensitivity issues. (Of course, if you want to start down this route, I'd suggest you start with a function which, given a directory name, determines if that directory's treats file names within it as case sensitive or not. I'm not even sure if this is possible to implement in any sane way, but it would be the key building block for any other path matching code, such as your os.path.hasext proposal). Paul.

On Tue, Apr 20, 2010 at 6:25 PM, Paul Moore <p.f.moore@gmail.com> wrote:
I am not sure to follow the issue. Do you mean that in the same filesystem, each directory can treat case sensivity differently ? I wasn't aware of that. How would this affect the extension btw ? I can imagine that the path+extensions could to be normalized before the matching job, but I don't see other issue, do you have an example ?
Paul.
-- Tarek Ziadé | http://ziade.org

On Tue, Apr 20, 2010 at 12:47 PM, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
I am not sure to follow the issue. Do you mean that in the same filesystem, each directory can treat case sensivity differently ? I wasn't aware of that.
I've never seen a filesystem that doesn't treat case-sensitivity consistently, regardless of path. But a single hierarchy composed of multiple filesystems may have different behaviors in different directories, because they're mounted from different filesystems.
I suspect a substantial part of the problem is really that Python doesn't expose an API to normalize a path based on it's actual location in the hierarchy; the operating system is used, but nothing that deals with a path on a mounted filesystem that has non-default behavior when compared to the traditional OS-centric expectations. For a specific example, consider mounting a case-insensitive HFS filesystem from Unix (well, most current Unixes): If the HFS is mounted at /MyHfsMount (on an ext3 filesystem), and contains a hierarchy /Folder/SomeFile.txt, the file would have an absolute path of /MyHfsMount/Folder/SomeFile.txt But the normalized path should be: /MyHfsMount/folder/somefile.txt (Note that each path segment is normalized according to the filesystem it's on, rather than the running OS.) -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Chaos is the score upon which reality is written." --Henry Miller

On Tue, 20 Apr 2010 18:47:50 +0200 Tarek Ziadé <ziade.tarek@gmail.com> wrote:
I don't think so. On the other hand, with modern file systems, if I need a directory that is case-insensitive on a file system that's case-sensitive, I'll just create a case-sensitive file system at that point. Doesn't even require root privs these days.
This gets back to user expectations - if foo.c and foo.C are different files, then foo.c doesn't have a .C extension. If they aren't different file, then it does. Which means you have to know whether files are being treated in a case insensitive manner or not before you can normalize the names. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 20 April 2010 17:47, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
I am not sure to follow the issue. Do you mean that in the same filesystem, each directory can treat case sensivity differently ? I wasn't aware of that.
Apologies, I was unclear (I was trying to avoid the term "filesystem" which isn't an obvious concept on Windows). Putting it more simply, you need to be able to determine the case sensitivity of the filesystem to determine if a file "has extension .xxx".
As Mike Meyer explained, whether file foo.C "has extension .c" depends (as far as the user is concerned) on whether the filesystem it's on is case sensitive. Many unix utility ports on Windows don't consider this, and they are quite annoying when the problem hits you (which admittedly isn't that often). Having an os.path.has_extension function in Python which doesn't consider case sensitivity would be an attractive nuisance, encouraging people to write subtly broken code like this. I'm -1 on such a function. Having said that, if you can write a function which detects filesystem case sensitivity (and write a "correct" has_extension function based on it), you'd be my hero (I believe it's a pretty difficult issue, if not impossible in the general case). Paul.

On Tue, Apr 20, 2010 at 11:42 PM, Paul Moore <p.f.moore@gmail.com> wrote:
Ok well, I doubt I'll be your hero, this is way over my head ;) The only simple way that comes in my mind is to write a test file to try it out but that's a hack But being able to detect it sounds like something we should definitely have in os/os.path.
Paul.
-- Tarek Ziadé | http://ziade.org

On 20 April 2010 22:55, Tarek Ziadé <ziade.tarek@gmail.com> wrote:
Ok well, I doubt I'll be your hero, this is way over my head ;)
:-)
Actually, I just dis a bit of digging - on Windows, GetVolumeInformation, or GetVolumeInformationByHandleW looks like it'd do the job. Does anyone know how to find out if a filesystem is case sensitive on Linux/MacOS/Unix...? Paul.

Paul Moore wrote:
I don't think that's really true. It's common to find, e.g., files ending with .jpg, .JPG, or other variations on a case-sensitive filesystem that got there by being copied from a case-insensitive one, or simply created by people using tools that don't care about the case of the extension. Seems to me the best thing to do is always compare extensions case-insensitively unless you have a specific reason to do otherwise. So I would recommend that any proposed hasextension() function should be case-insensitive by default. -- Greg

On Tue, Apr 20, 2010 at 8:18 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I've certainly seen cases where the case was relevant in order to determine how a file was handled. There have been compilers that interpreted .c as C and .C as C++. (Not that better options weren't available, but that was part of the default interpretation.) This was GCC, so... not particularly obscure. -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Chaos is the score upon which reality is written." --Henry Miller

On Wed, 21 Apr 2010 12:18:41 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
That's probably true for the vast majority of extensions, but I chose .C vs. .c for a reason - gcc thinks they are different: bhuda% cmp bunny.c bunny.C bhuda% file bunny.? bunny.C: ASCII C program text bunny.c: ASCII C program text bhuda% gcc bunny.c bhuda% g++ bunny.c bunny.c: In function 'int main(int, char**)': bunny.c:25: error: 'fork' was not declared in this scope bunny.c:26: error: 'wait' was not declared in this scope bhuda% gcc bunny.C bunny.C: In function 'int main(int, char**)': bunny.C:25: error: 'fork' was not declared in this scope bunny.C:26: error: 'wait' was not declared in this scope bhuda%
Reasonable, but if you have a way to make it case-sensitive, you're back where we started from: needing to figure out whether the files in question care about case. Given the number of options here - including that the file may not exist yet, and ditto for the file system it's going to be on - possibly the solution is to leave off that option, and document that this case has to be dealt with by the caller. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 21/04/10 12:47, Mike Meyer wrote:
It's not whether the *files* care about case, it's whether the *application* cares about case. For example, an application that deals with image files should probably recognise .jpg, .JPG, .Jpg, etc. as all meaning the same thing. An application that deals with source files, on the other hand, will probably want to distinguish between .c and .C. So I think it's likely that in any particular case, the caller of hasextension() will know whether he wants case-sensitivity or not. -- Greg

On 21 April 2010 02:33, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
True, but equally, the natural reaction for any but the most careful of programmers would be to say something like hasextension(fn, '.txt') without thinking. Hence my assertion that it's an attractive nuisance. In practice, it's no worse than splitext(fn) == '.txt', but it gives the impression that it should do a better job (even though, as you say, it can't). Once you get beyond the most basic operations, filename handling is hard. (I haven't even touched on MacOS unicode normalisation issues yet...) Getting it 90% right is easy (99% if you cover case sensitivity) - it's that last 1% that bites. (And yes, the 90%/99% split is, in my view, about accurate). Paul.

Paul Moore wrote:
That's why I think that case-insensitivity should be the default. Most of the time, people don't use case alone to distinguish between different file types. It's only in rare situations such as .c vs .C that the case matters, and in those situations, you know that you're dealing with something special. In any case, I don't believe it has anything to do with the *filesystem* on which the file happens to reside. As long as the filesystem is case-preserving, an application can regard file extensions as being case-sensitive or not as it pleases. -- Greg

On Wed, Apr 21, 2010 at 6:10 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
As long as the filesystem is case-preserving
But that's exactly the catch; you see things like .JPG and .GIF primarily on files that are copied from case-insensitive filesystems, and, less often, from software that hasn't been updated since those were pervasive on the operating system(s) the software was intended for (commonly written by programmers without experience with case-preserving systems). As such legacy fades (however slowly), should we *increase* the amount of code that deals with it, or should we move on? Keep in mind that with the advent of Python 3, we're putting a lot of effort into shedding much more recent "legacy". -Fred -- Fred L. Drake, Jr. <fdrake at gmail.com> "Chaos is the score upon which reality is written." --Henry Miller

On 21 April 2010 11:58, Fred Drake <fdrake@acm.org> wrote:
As such legacy fades (however slowly), should we *increase* the amount of code that deals with it, or should we move on?
It's not clear to me that it's all "legacy". I had the impression that MacOS used a case insensitive filesystem - is that right? Certainly, I know that MacOS uses Unicode normalisation, so simple string comparison is definitely not correct. (I only use MacOS as an example to avoid the assumption that this is a Windows-only issue - there are also case-insensitive filesystems available for Unix in general, if nothing else SMBFS). Certainly, non-case-preserving systems are a dying breed. But from what I've read, the situation around case sensitivity is far less clear - I've even seen comments from Unix users that claim they would like Unix to move away from case sensitivity. (I'm not qualified to judge the relevance or importance of such claims, I merely note that they exist). Like it or not, treating filenames as uninterpreted byte strings will violate some users' expectations. Library support for dealing with user expectations at a higher level would be good. Unfortunately, from my reading, it seems like at least on Linux, filesystem writers are expected to implement their own path handling (at least, that's how I interpret the FUSE documentation, and I couldn't find any lower level filesystem driver documentation that implied otherwise). So, at the filesystem level, it seems that on Linux all bets are off - I could, in theory, write a filesystem that was case insensitive for consonants, but case sensitive for vowels. And stored filenames in bit-reversed UTF-8. Bleh. Paul.

On Wed, 21 Apr 2010 16:06:01 +0100 Paul Moore <p.f.moore@gmail.com> wrote:
Half right: MacOS has a couple of native file systems, and they can all be either case-sensitive or case-insensitive. Most Mac applications now work correctly on case-sensitive file systems, so you can safely make all your Mac file systems case-sensitive, which wasn't always the case. If you're using unix software - well, it's not all case-insensitive friendly. zfs file systems - used on Solaris and BSD, and available though orphaned for MacOS - also consider case sensitivity optional.
Unicode normalization is an option for zfs file systems. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On 04/22/10 01:24, Mike Meyer wrote:
how about: def hasextension(filename, ext, smartness=0): # when smartness = 0 filename is treated as uninterpreted bytestring # when smartness = 10, filename will be: # - unicode normalization # - case normalization # - accent normalization # - synonym normalization # - language normalization (have filename in French?) # - mistyping normalization # - extension normalization ('.jpg' matches '.jpg' and '.jpeg') # - wrapper normalization (.tgz is normalized to .tar.gz) # - filetype normalization (.mpg matches .mov, .avi, .ogg, etc) ... magic code follows ... :-)

On Wed, 21 Apr 2010 13:33:30 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Actually, it's the file system. The *application* case is dealt with by the option of doing a case-insensitive test. If you turn that off, then you have to figure out whether or not the file system is case insensitive to do it right.
Right. I'm talking about the case where the developer knows he wants case sensitivity. In that case, you have to know whether or not the file system is case sensitive to know whether or not "file.c" would get opened if the application tried to open "file.C". <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

On Wed, Apr 21, 2010 at 10:19 AM, Mike Meyer <mwm-keyword-python.b4bdba@mired.org> wrote:
Right. I'm talking about the case where the developer knows he wants case sensitivity.
And can therefore call a different library function, or pass an optional flag argument.
uhm ... Why? If it would, there really isn't any way to tell what the "real" name is ... or do you just mean that it should cease to be case-sensitive in exactly those cases where file.c is file.C, as a way of breaking less than it otherwise would? -jJ

Mike Meyer wrote:
What? I thought the use case we were talking about is where you have a filename and you want to make a guess about what kind of data it contains, based on the extension. I don't think I've ever wanted to find out whether a file will get opened if I use the wrong case. I just trust that the filename I'm given is suitable for passing to open(), whatever case it might have. I also can't think why you might be worried about that just for the extension, and not the rest of the pathname. -- Greg

On 22 April 2010 00:09, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I believe that the use case being discussed goes as follows: 1. User supplies name FILE.C to the program 2. Program has different behaviours for .c and .C files 3. Program needs to decide whether to use the .c or .C behaviour Option 1 is to use .C always on the basis that you believe that the user should expect case sensitive behaviour. Option 2 is to acknowledge that if the filesystem is case insensitive, the user will expect the same behaviour for both .c and .C files (assume .c is the "obvious" one) and do that. Note that with option 2, the user isn't necessarily being perverse by expecting .c behaviour. Maybe a legacy program created the file, using the all-uppercase name, and the user didn't (want to, remember to) change it. Maybe the user had caps lock on by accident, and doesn't want to waste time correcting the whole line. The point is that users using a case insensitive filesystem have a reasonable expectation that that programs will ignore the case of filenames (even in cases where they don't have to do so). Violating that expectation is bad. A second use case: 1. User has a directory containing a number of temporary files, all with extension .tmp (but in varying case) 2. User has a "cleanup" program which deletes temporary files The user will quite reasonably expect the cleanup program to delete all files with extension .tmp, regardless of the case. Essentially, in this simplified example, any behaviour that differs from DEL *.TMP would be considered a bug. Hope this clarifies things a bit. Paul

On Thu, 22 Apr 2010 11:09:53 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
How are these different? If the user typed the filename "foo.c" and I am trying to decide if it has the ".C" extension. If "foo.C" exists on the disk and the user knows that "foo.c" and "foo.C" are the same file, it's reasonable for the user to expect the application to figure out that this file has the ".C" extension, even though they typed ".c". So whether or not the comparison should be case sensitive depends on whether or not the file system is case sensitive. Likewise, if we're *saving* the file, and the user expects us to automatically add the right extension, then if the file system is case sensitive, ".c" is the wrong extension. If the file system is case insensitive, it's not clear to me what the user might expect. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer wrote:
But you don't know whether the user is expecting this. If he knows that gcc distinguishes between .c and .C, it may be that he is expecting foo to be compiled as a C file rather than C++, and he made a mistake when he named the file foo.C. -- Greg

On Fri, 23 Apr 2010 12:33:10 +1200 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Given that the argument for making the test case-insensitive to start with was that the user expects .JPG and .jpg to be the same, I think you just shot down the case for making the test case-insensitive as well. <mike -- Mike Meyer <mwm@mired.org> http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Mike Meyer wrote:
I only said it should be case-sensitive as a *default*. The .c/.C thing is a special case, which the application can deal with by explicitly requesting a case-sensitive comparison. Most of the time you'll be dealing with things like .jpg/.JPG, where I don't think any harm could be caused by treating the extensions case-insensitively always. I still don't think you can decide based on the file system behaviour. What if the user creates his files on a case-insensitive system and then copies them to a case-sensitive one or vice versa? The intended meaning of the extensions doesn't suddenly change when that happens. -- Greg
participants (8)
-
Bruce Leban
-
Fred Drake
-
Greg Ewing
-
Jim Jewett
-
Lie Ryan
-
Mike Meyer
-
Paul Moore
-
Tarek Ziadé