Allow manual creation of DirEntry objects

Hi, I have been using the 'scandir' package (https://github.com/benhoyt/scandir) for a while now to speed up some directory tree processing code. Since Python 3.5 now includes 'os.scandir' in the stdlib (https://www.python.org/dev/peps/pep-0471/) I decided to try to make my code work with the built-in version if available. The first issue I hit was that the 'DirEntry' class was not actually being exposed (http://bugs.python.org/issue27038). However in the discussion of that bug I noticed that the constructor for the 'DirEntry' class was deliberately being left undocumented and that there was no clear way to manually create a DirEntry object from a path. I brought up my objections to this decision in the bug tracker and was asked to have the discussion over here on python-ideas. I have a bunch of functions that operate on DirEntry objects, typically doing some sort of filtering to select the paths I actually want to process. The overwhelming majority of the time these functions are going to be operating on DirEntry objects produced by the scandir function, but there are some cases where the user will be supplying the path themselves (for example, the root of a directory tree to process). In my current code base that uses the scandir package I just wrap these paths in a 'GenericDirEntry' object and then pass them through the filter functions the same as any results coming from the scandir function. With the decision to not expose any method in the stdlib to manually create a DirEntry object, I am stuck with no good options. The least bad option I guess would be to copy the GenericDirEntry code out of the scandir package into my own code base. This seems rather silly. I really don't understand the rationale for not giving users a way to create these objects themselves, and I haven't actually seen that explained anywhere. I guess people are unhappy with the overlap between pathlib.Path objects and DirEntry objects and this is a misguided attempt to prod people into using pathlib. I think a better approach is to document the differences between DirEntry and pathlib.Path objects and encourage users to default to using pathlib.Path unless they have good reasons for using DirEntry. Thanks, Brendan

It sounds fine to just submit a patch to add and document the DirEntry constructor. I don't think anyone intended to disallow your use case, it's more likely that nobody thought of it. On Tue, Aug 16, 2016 at 12:35 PM, Brendan Moloney <moloney@ohsu.edu> wrote:
-- --Guido van Rossum (python.org/~guido)

2016-08-16 23:13 GMT+02:00 Guido van Rossum <guido@python.org>:
Currently, the DirEntry constructor expects data which comes from opendir/readdir functions on UNIX/BSD or FindFirstFile/FindNextFile functions on Windows. These functions are not exposed in Python, so it's unlikely that you can get expected value. The DirEntry object was created to avoid syscalls in the common case thanks to data provided by these functions. But I guess that Brendan wants to create a DirEntry object which would call os.stat() the first time that an attribute is read and then benefit of the code. You loose the "no syscall" optimization, since at least once syscall is needed. In this case, I guess that the constructor should be DirEntry(directory, entry_name) where os.path.join(directory, entry_name) is the full path. An issue is how to document the behaviour of DirEntry. Objects created by os.scandir() would be "optimized", whereas objects created manually would be "less optimized". DirEntry is designed for os.scandir(), it's very limited compared to pathlib. IMO pathlib would be a better candidate for "cached os.stat results" with a full API to access the file system. Victor

By the way, for all these reasons, I'm not really excited by Python 3.6 change exposing os.DirEntry ( https://bugs.python.org/issue27038 ). Victor 2016-08-17 1:11 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:

On Tue, Aug 16, 2016 at 4:14 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
But that's separate from the constructor. We could expose the class with a constructor that always fails (the C code could construct instances through a backdoor). Exposing the type is useful for type annotations, e.g. def is_foobar(de: os.DirEntry) -> bool: ... and for the occasional isinstance() check. Also, what does the scandir package mentioned by the OP use as the constructor signature? -- --Guido van Rossum (python.org/~guido)

2016-08-17 1:50 GMT+02:00 Guido van Rossum <guido@python.org>:
Oh, in fact you cannot create an instance of os.DirEntry, it has no (Python) constructor: $ ./python Python 3.6.0a4+ (default:e615718a6455+, Aug 17 2016, 00:12:17)
Only os.scandir() can produce such objects. The question is still if it makes sense to allow to create DirEntry objects in Python :-)
Also, what does the scandir package mentioned by the OP use as the constructor signature?
The implementation of os.scandir() comes from the scandir package. It contains the same code, and so has the same behaviour (DirEntry has no constructor). Victor

On 17 August 2016 at 09:56, Victor Stinner <victor.stinner@gmail.com> wrote:
I think it does, as it isn't really any different from someone calling the stat() method on a DirEntry instance created by os.scandir(). It also prevents folks attempting things like: def slow_constructor(dirname, entryname): for entry in os.scandir(dirname): if entry.name == entryname: entry.stat() return entry Allowing DirEntry construction from Python further gives us a straightforward answer to the "stat caching" question: "just use os.DirEntry instances and call stat() to make the snapshot" If folks ask why os.DirEntry caches results when pathlib.Path doesn't, we have the answer that cache invalidation is a hard problem, and hence we consider it useful in the lower level interface that is optimised for speed, but problematic in the higher level one that is more focused on cross-platform correctness of filesystem interactions. I don't know whether it would make sense to allow a pre-existing stat result to be based to DirEntry, but it does seem like it might be useful for adapting existing stat-based backend APIs to a more user friendly DirEntry based front end API. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks, opened an issue here: http://bugs.python.org/issue27796 -Brendan ________________________________ From: gvanrossum@gmail.com [gvanrossum@gmail.com] on behalf of Guido van Rossum [guido@python.org] Sent: Wednesday, August 17, 2016 7:20 AM To: Nick Coghlan; Brendan Moloney Cc: Victor Stinner; python-ideas@python.org Subject: Re: [Python-ideas] Allow manual creation of DirEntry objects Brendan, The conclusion is that you should just file a bug asking for a working constructor -- or upload a patch if you want to. --Guido On Wed, Aug 17, 2016 at 12:18 AM, Nick Coghlan <ncoghlan@gmail.com<mailto:ncoghlan@gmail.com>> wrote: On 17 August 2016 at 09:56, Victor Stinner <victor.stinner@gmail.com<mailto:victor.stinner@gmail.com>> wrote:
I think it does, as it isn't really any different from someone calling the stat() method on a DirEntry instance created by os.scandir(). It also prevents folks attempting things like: def slow_constructor(dirname, entryname): for entry in os.scandir(dirname): if entry.name<http://entry.name> == entryname: entry.stat() return entry Allowing DirEntry construction from Python further gives us a straightforward answer to the "stat caching" question: "just use os.DirEntry instances and call stat() to make the snapshot" If folks ask why os.DirEntry caches results when pathlib.Path doesn't, we have the answer that cache invalidation is a hard problem, and hence we consider it useful in the lower level interface that is optimised for speed, but problematic in the higher level one that is more focused on cross-platform correctness of filesystem interactions. I don't know whether it would make sense to allow a pre-existing stat result to be based to DirEntry, but it does seem like it might be useful for adapting existing stat-based backend APIs to a more user friendly DirEntry based front end API. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com<mailto:ncoghlan@gmail.com> | Brisbane, Australia -- --Guido van Rossum (python.org/~guido<http://python.org/~guido>)

It sounds fine to just submit a patch to add and document the DirEntry constructor. I don't think anyone intended to disallow your use case, it's more likely that nobody thought of it. On Tue, Aug 16, 2016 at 12:35 PM, Brendan Moloney <moloney@ohsu.edu> wrote:
-- --Guido van Rossum (python.org/~guido)

2016-08-16 23:13 GMT+02:00 Guido van Rossum <guido@python.org>:
Currently, the DirEntry constructor expects data which comes from opendir/readdir functions on UNIX/BSD or FindFirstFile/FindNextFile functions on Windows. These functions are not exposed in Python, so it's unlikely that you can get expected value. The DirEntry object was created to avoid syscalls in the common case thanks to data provided by these functions. But I guess that Brendan wants to create a DirEntry object which would call os.stat() the first time that an attribute is read and then benefit of the code. You loose the "no syscall" optimization, since at least once syscall is needed. In this case, I guess that the constructor should be DirEntry(directory, entry_name) where os.path.join(directory, entry_name) is the full path. An issue is how to document the behaviour of DirEntry. Objects created by os.scandir() would be "optimized", whereas objects created manually would be "less optimized". DirEntry is designed for os.scandir(), it's very limited compared to pathlib. IMO pathlib would be a better candidate for "cached os.stat results" with a full API to access the file system. Victor

By the way, for all these reasons, I'm not really excited by Python 3.6 change exposing os.DirEntry ( https://bugs.python.org/issue27038 ). Victor 2016-08-17 1:11 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:

On Tue, Aug 16, 2016 at 4:14 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
But that's separate from the constructor. We could expose the class with a constructor that always fails (the C code could construct instances through a backdoor). Exposing the type is useful for type annotations, e.g. def is_foobar(de: os.DirEntry) -> bool: ... and for the occasional isinstance() check. Also, what does the scandir package mentioned by the OP use as the constructor signature? -- --Guido van Rossum (python.org/~guido)

2016-08-17 1:50 GMT+02:00 Guido van Rossum <guido@python.org>:
Oh, in fact you cannot create an instance of os.DirEntry, it has no (Python) constructor: $ ./python Python 3.6.0a4+ (default:e615718a6455+, Aug 17 2016, 00:12:17)
Only os.scandir() can produce such objects. The question is still if it makes sense to allow to create DirEntry objects in Python :-)
Also, what does the scandir package mentioned by the OP use as the constructor signature?
The implementation of os.scandir() comes from the scandir package. It contains the same code, and so has the same behaviour (DirEntry has no constructor). Victor

On 17 August 2016 at 09:56, Victor Stinner <victor.stinner@gmail.com> wrote:
I think it does, as it isn't really any different from someone calling the stat() method on a DirEntry instance created by os.scandir(). It also prevents folks attempting things like: def slow_constructor(dirname, entryname): for entry in os.scandir(dirname): if entry.name == entryname: entry.stat() return entry Allowing DirEntry construction from Python further gives us a straightforward answer to the "stat caching" question: "just use os.DirEntry instances and call stat() to make the snapshot" If folks ask why os.DirEntry caches results when pathlib.Path doesn't, we have the answer that cache invalidation is a hard problem, and hence we consider it useful in the lower level interface that is optimised for speed, but problematic in the higher level one that is more focused on cross-platform correctness of filesystem interactions. I don't know whether it would make sense to allow a pre-existing stat result to be based to DirEntry, but it does seem like it might be useful for adapting existing stat-based backend APIs to a more user friendly DirEntry based front end API. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks, opened an issue here: http://bugs.python.org/issue27796 -Brendan ________________________________ From: gvanrossum@gmail.com [gvanrossum@gmail.com] on behalf of Guido van Rossum [guido@python.org] Sent: Wednesday, August 17, 2016 7:20 AM To: Nick Coghlan; Brendan Moloney Cc: Victor Stinner; python-ideas@python.org Subject: Re: [Python-ideas] Allow manual creation of DirEntry objects Brendan, The conclusion is that you should just file a bug asking for a working constructor -- or upload a patch if you want to. --Guido On Wed, Aug 17, 2016 at 12:18 AM, Nick Coghlan <ncoghlan@gmail.com<mailto:ncoghlan@gmail.com>> wrote: On 17 August 2016 at 09:56, Victor Stinner <victor.stinner@gmail.com<mailto:victor.stinner@gmail.com>> wrote:
I think it does, as it isn't really any different from someone calling the stat() method on a DirEntry instance created by os.scandir(). It also prevents folks attempting things like: def slow_constructor(dirname, entryname): for entry in os.scandir(dirname): if entry.name<http://entry.name> == entryname: entry.stat() return entry Allowing DirEntry construction from Python further gives us a straightforward answer to the "stat caching" question: "just use os.DirEntry instances and call stat() to make the snapshot" If folks ask why os.DirEntry caches results when pathlib.Path doesn't, we have the answer that cache invalidation is a hard problem, and hence we consider it useful in the lower level interface that is optimised for speed, but problematic in the higher level one that is more focused on cross-platform correctness of filesystem interactions. I don't know whether it would make sense to allow a pre-existing stat result to be based to DirEntry, but it does seem like it might be useful for adapting existing stat-based backend APIs to a more user friendly DirEntry based front end API. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com<mailto:ncoghlan@gmail.com> | Brisbane, Australia -- --Guido van Rossum (python.org/~guido<http://python.org/~guido>)
participants (6)
-
Brendan Moloney
-
Brett Cannon
-
Guido van Rossum
-
Nick Coghlan
-
Serhiy Storchaka
-
Victor Stinner