Object interface to path names

I've recently been developing a tool to track changes to a fairly large structured file system, and in the process got to thinking about how working with path names could be improved. The problems I've had with using just os and os.path have lead to three objectives of any new implementation: 1. Cleaner handling of paths names, specifically constructing path names without the need for a lot of nested os.path.join() and os.split() functions. 2. Allow a validation of paths names based on predefined rules. (Although this requirement might be very specific to my use case) 3. Allow caching of file attribute data so that queries do not have to wait the disk or network to respond (although at the cost of accuracy). The first can be met with behaviour as follows, basically handling paths as containers of sub-paths:
The second can be met by allowing Path to be subclassed and defining factory functions for the creation of sub-paths:
The third can be met be allowing all disk calls to be asynchonous:
This could all be implemented in pure python using os and os.path, and threading for asynchonous calls. I haven't yet thought through a complete specification for Path, but I image it would need to contain functions such as exists(), isfile(), isdir(), stat(), walk(), and allow iterator access to children. Does anyone else see a usefulness for this? Regards David

There are existing implementations of this sort of OOP approach, I think it'd be worth trying to adopt one of them instead of creating a new ad-hoc interface. Here are the two that come to mind: http://twistedmatrix.com/documents/10.1.0/api/twisted.python.filepath.FilePa... https://bitbucket.org/birkenfeld/sphinx/src/cf794ec8a096/tests/path.py I support including such an API (+1), but I don't believe you mentioned my preferred use-case: Having an object-oriented / polymorphic file API means that one can provide file objects that aren't backed by the filesystem, but look the same. This is cool in that you can do things like treat zip files as directory, or mock out the filesystem for a unit test, without worrying about monkeypatching builtins or using a nonstandard wrapper API etc. Devin On Tue, Sep 13, 2011 at 11:03 AM, David Townshend <aquavitae69@gmail.com> wrote:

On Tue, Sep 13, 2011 at 17:03, David Townshend <aquavitae69@gmail.com> wrote:
Does anyone else see a usefulness for this?
Yes. IIRC there are a number of implementations on PyPI, and it probably has come up on this mailing list before (search the archives for the outcome!). http://pypi.python.org/pypi/Unipath/0.2.1 http://pypi.python.org/pypi/fpath/0.6 http://pypi.python.org/pypi/forked-path/0.2 http://pypi.python.org/pypi/path.py Cheers, Dirkjan

But none of these seem to allow asynchronous calls, which make a huge difference when dealing with a large structure. Nothing I've found really does what I need and I'm trying to keep my dependency list short so I'll end up writing something myself anyway, but my real question was whether this is something that could usefully be included in the stdlib. Having an object-oriented / polymorphic file API means that one can
Great use case! And using factories to create create the objects would make this especially powerful.

On Wed, Sep 14, 2011 at 1:33 AM, David Townshend <aquavitae69@gmail.com> wrote:
The thing is, the "smart path" abstraction level isn't adequate for that task - you need to do more to ensure various primitives (like listing directory contents) are handled properly, along the lines of what PyFileSystem provides (https://code.google.com/p/pyfilesystem/). As far as a smart path object goes, the previous major effort on this front was PEP 355, which focused on the API offered by Jason Orendorff's path module. While the PEP was ultimately rejected due to the "one class to rule them all" nature of that particular interface, it's still an excellent reference on why improving the standard library's filesystem abstraction is an area worth exploring further. FWIW, PyFileSystem is the only package I've seen that I think comes close to getting the abstraction level right (I've never actually needed to use it for anything though). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

That library, pyfilesystem has some serious firepower. On Sep 14, 2011 1:33 AM, "David Townshend" <aquavitae69@gmail.com> wrote:
But none of these seem to allow asynchronous calls, which make a huge
difference when dealing with a large structure. Nothing I've found really does what I need and I'm trying to keep my dependency list short so I'll end up writing something myself anyway, but my real question was whether this is something that could usefully be included in the stdlib.
make this especially powerful.

Thanks for the pointer - PyFilesystem looks perfect for my requirements! I've had a look at PEP 355 and can see the drawbacks to the proposed implementation (especially subclassing str), but I'm not clear on what the problems are with the concept. Would something like PyFilesystem be any more palatable? On Wed, Sep 14, 2011 at 1:22 AM, Matt Joiner <anacrolix@gmail.com> wrote:

On Wed, Sep 14, 2011 at 3:49 PM, David Townshend <aquavitae69@gmail.com> wrote:
There's nothing wrong with the concept of a more object-oriented interface to the filesystem - PEP 355 was rejected due to the specifics of the proposed API rather than the idea in general being unacceptable. However, designing a nice OO filesystem API, getting feedback on it, getting it to a point where it is evolving slowly enough to be a suitable for inclusion in the stdlib, then getting it through the gauntlet of python-dev's design and implementation critique for standard library inclusion isn't exactly a task for the faint-hearted :)
Would something like PyFilesystem be any more palatable?
I don't know the PyFilesystem API well enough to really say. However, from my quick review of the docs, it has potential. It's API oriented approach definitely aligns well with the role of the standard library in other areas (such as the file-like API itself, as well as more formal interfaces like the DB API and the crypto component APIs). I've cc'ed Ryan on a couple of messages in this thread, so hopefully he'll be inclined to chime in. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 14/09/11 03:03, David Townshend wrote:
You're mixing up two completely different concepts here. Cacheing has nothing to do with asynchronous calls; it's storing the result so that you don't have to wait the *next* time you want the information. Both could be useful, but only for certain applications, and they should both be off by default in any general-purpose impelentation. -- Greg

On Thu, Sep 15, 2011 at 5:55 AM, Greg Ewing <greg.ewing@canterbury.ac.nz>wrote:
I'm not sure if I'm mixing up concepts or terminology. My meaning is, for example, if a method Path.files() is used to obtain a list of files in a directory, it would call os.listdir() in another thread and store the result to a cache. At the same time, the current contents of the cache are returned by Path.files(). Most of the time, the cache would be written to after its contents are returned by Path.files(), so the actual value returned would be inaccurate, but would be more accurate on the next call. To me, this means caching the result of an asynchronous call to os.listdir(). I suspect, however, that I'm not using the term "asynchronous" as it normally refers to disk operations, so that's probably the confusion.

David Townshend wrote:
I suspect, however, that I'm not using the term "asynchronous" as it normally refers to disk operations,
It sounds like you're using a rather application-dependent combination of asynchronous I/O and cacheing. The way asynchronous I/O is normally used is that you start the operation, go away and do something else, and either come back later to check whether it's finished or arrange some kind of callback when it finishes. Most applications would not be tolerant of inaccurate results. -- Greg

There are existing implementations of this sort of OOP approach, I think it'd be worth trying to adopt one of them instead of creating a new ad-hoc interface. Here are the two that come to mind: http://twistedmatrix.com/documents/10.1.0/api/twisted.python.filepath.FilePa... https://bitbucket.org/birkenfeld/sphinx/src/cf794ec8a096/tests/path.py I support including such an API (+1), but I don't believe you mentioned my preferred use-case: Having an object-oriented / polymorphic file API means that one can provide file objects that aren't backed by the filesystem, but look the same. This is cool in that you can do things like treat zip files as directory, or mock out the filesystem for a unit test, without worrying about monkeypatching builtins or using a nonstandard wrapper API etc. Devin On Tue, Sep 13, 2011 at 11:03 AM, David Townshend <aquavitae69@gmail.com> wrote:

On Tue, Sep 13, 2011 at 17:03, David Townshend <aquavitae69@gmail.com> wrote:
Does anyone else see a usefulness for this?
Yes. IIRC there are a number of implementations on PyPI, and it probably has come up on this mailing list before (search the archives for the outcome!). http://pypi.python.org/pypi/Unipath/0.2.1 http://pypi.python.org/pypi/fpath/0.6 http://pypi.python.org/pypi/forked-path/0.2 http://pypi.python.org/pypi/path.py Cheers, Dirkjan

But none of these seem to allow asynchronous calls, which make a huge difference when dealing with a large structure. Nothing I've found really does what I need and I'm trying to keep my dependency list short so I'll end up writing something myself anyway, but my real question was whether this is something that could usefully be included in the stdlib. Having an object-oriented / polymorphic file API means that one can
Great use case! And using factories to create create the objects would make this especially powerful.

On Wed, Sep 14, 2011 at 1:33 AM, David Townshend <aquavitae69@gmail.com> wrote:
The thing is, the "smart path" abstraction level isn't adequate for that task - you need to do more to ensure various primitives (like listing directory contents) are handled properly, along the lines of what PyFileSystem provides (https://code.google.com/p/pyfilesystem/). As far as a smart path object goes, the previous major effort on this front was PEP 355, which focused on the API offered by Jason Orendorff's path module. While the PEP was ultimately rejected due to the "one class to rule them all" nature of that particular interface, it's still an excellent reference on why improving the standard library's filesystem abstraction is an area worth exploring further. FWIW, PyFileSystem is the only package I've seen that I think comes close to getting the abstraction level right (I've never actually needed to use it for anything though). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

That library, pyfilesystem has some serious firepower. On Sep 14, 2011 1:33 AM, "David Townshend" <aquavitae69@gmail.com> wrote:
But none of these seem to allow asynchronous calls, which make a huge
difference when dealing with a large structure. Nothing I've found really does what I need and I'm trying to keep my dependency list short so I'll end up writing something myself anyway, but my real question was whether this is something that could usefully be included in the stdlib.
make this especially powerful.

Thanks for the pointer - PyFilesystem looks perfect for my requirements! I've had a look at PEP 355 and can see the drawbacks to the proposed implementation (especially subclassing str), but I'm not clear on what the problems are with the concept. Would something like PyFilesystem be any more palatable? On Wed, Sep 14, 2011 at 1:22 AM, Matt Joiner <anacrolix@gmail.com> wrote:

On Wed, Sep 14, 2011 at 3:49 PM, David Townshend <aquavitae69@gmail.com> wrote:
There's nothing wrong with the concept of a more object-oriented interface to the filesystem - PEP 355 was rejected due to the specifics of the proposed API rather than the idea in general being unacceptable. However, designing a nice OO filesystem API, getting feedback on it, getting it to a point where it is evolving slowly enough to be a suitable for inclusion in the stdlib, then getting it through the gauntlet of python-dev's design and implementation critique for standard library inclusion isn't exactly a task for the faint-hearted :)
Would something like PyFilesystem be any more palatable?
I don't know the PyFilesystem API well enough to really say. However, from my quick review of the docs, it has potential. It's API oriented approach definitely aligns well with the role of the standard library in other areas (such as the file-like API itself, as well as more formal interfaces like the DB API and the crypto component APIs). I've cc'ed Ryan on a couple of messages in this thread, so hopefully he'll be inclined to chime in. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 14/09/11 03:03, David Townshend wrote:
You're mixing up two completely different concepts here. Cacheing has nothing to do with asynchronous calls; it's storing the result so that you don't have to wait the *next* time you want the information. Both could be useful, but only for certain applications, and they should both be off by default in any general-purpose impelentation. -- Greg

On Thu, Sep 15, 2011 at 5:55 AM, Greg Ewing <greg.ewing@canterbury.ac.nz>wrote:
I'm not sure if I'm mixing up concepts or terminology. My meaning is, for example, if a method Path.files() is used to obtain a list of files in a directory, it would call os.listdir() in another thread and store the result to a cache. At the same time, the current contents of the cache are returned by Path.files(). Most of the time, the cache would be written to after its contents are returned by Path.files(), so the actual value returned would be inaccurate, but would be more accurate on the next call. To me, this means caching the result of an asynchronous call to os.listdir(). I suspect, however, that I'm not using the term "asynchronous" as it normally refers to disk operations, so that's probably the confusion.

David Townshend wrote:
I suspect, however, that I'm not using the term "asynchronous" as it normally refers to disk operations,
It sounds like you're using a rather application-dependent combination of asynchronous I/O and cacheing. The way asynchronous I/O is normally used is that you start the operation, go away and do something else, and either come back later to check whether it's finished or arrange some kind of callback when it finishes. Most applications would not be tolerant of inaccurate results. -- Greg
participants (6)
-
David Townshend
-
Devin Jeanpierre
-
Dirkjan Ochtman
-
Greg Ewing
-
Matt Joiner
-
Nick Coghlan