
On 11 January 2016 at 18:57, Gregory P. Smith <greg@krypto.org> wrote:
On Wed, Jan 6, 2016 at 3:05 PM Brendan Moloney <moloney@ohsu.edu> wrote:
Its important to keep in mind the main benefit of scandir is you don't have to do ANY stat call in many cases, because the directory listing provides some subset of this info. On Linux you can at least tell if a path is a file or directory. On windows there is much more info provided by the directory listing. Avoiding subsequent stat calls is also nice, but not nearly as important due to OS level caching.
+1 - this was one of the two primary motivations behind scandir. Anything trying to reimplement a filesystem tree walker without using scandir is going to have sub-standard performance.
If we ever offer anything with "find like functionality" related to pathlib, it needs to be based on scandir. Anything else would just be repeating the convenient but untrue limiting assumptions of os.listdir: That the contents of a directory can be loaded into memory and that we don't mind re-querying the OS for stat information that it already gave us but we threw away as part of reading the directory.
This is very much why I feel that we need something in pathlib. I understand the motivation for not caching stat information in path objects. And I don't have a viable design for how a "find-like functionality" API should be implemented in pathlib. But as it stands, I feel as though using pathlib for anything that does bulk filesystem scans is deliberately choosing something that I know won't scale well. So (in my mind) pathlib doesn't fulfil the role of "one obvious way to do things". Which is a shame, because Path.rglob is very often far closer to what I need in my programs than os.walk (even when it's just rootpath.rglob('*')). In practice, by far the most common need I have[1] for filetree walking is to want to get back a list of all the names of files starting at a particular directory with the returned filenames *relative to the given root*. Pathlib.rglob gives absolute pathnames. os.walk gives the absolute directory name and the base filename. Neither is what I want, although obviously in both cases it's pretty trivial to extract the "relative to the root" part from the returned data. But an API that gave that information directly, with scandir-level speed and scalability, in the form of pathlib.Path relative path objects, would be ideal for me[1]. Paul [1] And yes, I know this means I should just write a utility function for it :-) [2] The feature creep starts when people want to control things like pruning particular directories such as '.git', or only matching particular glob patterns, or choosing whether or not to include directories in the output, or... Adding *those* features without ending up with a Frankenstein's monster of an API is the challenge :-)