(Wow, what a rambling message. I'm not sure which part you hope to see addressed.)

On Tue, Dec 22, 2015 at 1:54 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
On Tuesday, December 22, 2015 12:14 PM, Guido van Rossum <guido@python.org> wrote:

>The UNIX find tool has many, many options.


I think a Pythonicized, stripped-down version of the basic design of fts (http://man7.org/linux/man-pages/man3/fts.3.html) is as simple as you're going to get. After all, fts was designed to make it as easy as possible to implement find efficiently.

The docs make no attempt at showing the common patterns. The API described looks horribly complex (I guess that's what you get when all that matters is efficient implementation).
 
In my incomplete Python wrapper around fts, the simplest use looks like:

    with fts(root) as f:
        for path in f:
            do_stuff(path)

No two-level iteration, no need to join the root to the paths, no handling dirs and files separately.

The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense, and remarkably often there *is* something where the two-level iteration helps (otherwise I'm sure you'd see lots of code that's trying to recover the directory by parsing the path and remembering the previous path and comparing the two).
 


Of course for that basic use case, you could just write your own wrapper around os.walk:

    def flatwalk(*args, **kwargs):
        return (os.path.join(root, file)
                for file in files for root, dirs, files in os.walk(*args, **kwargs))

But more complex uses build on fts pretty readably:

    # find "$@" -H -xdev -type f -mtime 1 -iname '*.pyc' -exec do_stuff '{}' \;
    yesterday = datetime.now() - timedelta(days=1)
    with fts(top, stat=True, crossdev=False) as f:
        for path in f:
            if path.is_file and path.stat.st_mtime < yesterday and path.lower().endswith('.pyc'):
                do_stuff(path)

Why does this use a with *and* a for-loop? Is there some terribly important cleanup that needs to happen when the for-loop is aborted?

It also shows off the arbitrariness of the fts API -- fts() seems to have a bunch of random keyword args to control a variety of aspects of its behavior and the returned path objects look like they have a rather bizarre API: e.g. why is is_file a property on path, mtime a property on path.stat, and lower() a method on path directly? (And would path also have an endswith() method directly, in case I don't need to lowercase it?)

Of course that's can all be cleaned up easily enough -- it's a simple matter of API design.
 


When you actually need to go a directory at a time, like the spool directory size example in the stdlib, os.walk is arguably nicer, but fortunately os.walk already exists.

I've never seen that example. But just a few days ago I wrote a little bit of code where the os.walk() API came in handy:

for root, dirs, files in os.walk(arg):
    print("Scanning %s (%d files):" % (root, len(files)))
    for file in files:
        process(os.path.join(root, file))

(The point is not that we have access to dirs separately, but that we have the directories filtered out of the count of files.)
 
The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler.

Would it make sense to engage in a little duck typing and have an API that mimicked the API of Path objects but caches the stat() information? This could be built on top of scandir(), which provides some of the information without needing extra syscalls (depending on the platform). But even where a syscall() is still needed, this hypothetical Path-like object could cache the stat() result. If this type of result was only returned by a new hypothetical integration of os.walk() and pathlib, the caching would not be objectionable (it would simply be a limitation of the pathwalk API, rather than of the Path object).
 
Anyway, if you don't want either the efficiency or the simplicity, and just want an iterable of filenames or Paths, you might as well just use the wrapper around the existing os.walk that I wrote above. To make it works with Path objects:


    def flatpathwalk(root, *args, **kwargs):

        return map(path.Path, flatwalk(str(root), *args, **kwargs))

And then to use those Path objects:

    matches = (path for path in flatpathwalk(root) if pattern.match(str(path)))

> For the general case it's probably easier to use os.walk(). But there are probably some
> common uses that deserve better direct support in e.g. the glob module. Would just a way
> to recursively search for matches using e.g. "**.txt" be sufficient? If not, can you
> specify what else you'd like? (Just " find-like" is too vague.)>--Guido (mobile)

pathlib already has a glob method, which handles '*/*.py' and even recursive '**/*.py' (and a match method to go with it). If that's sufficient, it's already there. Adding direct support for Path objects in the glob module would just be a second way to do the exact same thing. And honestly, if open, os.walk, etc. aren't going to work with Path objects, why should glob.glob?

Oh, I'd forgotten about pathlib.Path.rglob().

Maybe the OP also didn't know about it? He claimed he just wanted to use regular expressions so he could exclude .git directories. To tell the truth, I don't have much sympathy for that: regular expressions are just too full of traps to make a good API for file matching, and it wouldn't even strictly be sufficient to filter the entire directory tree under .git unless you added matching on the entire path -- but then you'd still pay for the cost of traversing the .git tree even if your regex were to exclude it entirely, because the library wouldn't be able to introspect the regex to determine that for sure.

He also insisted on staying withing the Path framework, which is an indication that maybe what we're really looking for here is the hybrid of walk/scandir/Path that I was trying to allude to above.
 
* Honestly, I think the problem here is that the pathlib module is just not useful. In a new language that used path objects--or, probably, URL objects--everywhere, it would be hard to design something better than pathlib, but as it is, while it's great for making really hairy path manipulation more readable, path manipulation never _gets_ really hairy, and os.path is already very well designed, and the fact that pathlib doesn't know how to interact with anything else in the stdlib or third-party code means that the wrapper stuff that constructs a Path on one end and calls str or bytes on the other end depending on which one you originally had adds as much complexity as you saved. But that's obviously off-topic here.

Seems the OP disagrees with you here -- he really wants to use pathlib (as was clear from his response to a suggestion to use fnmatch).

Truly pushing for adoption of a new abstraction like this takes many years -- pathlib was new (and provisional) in 3.4 so it really hasn't been long enough to give up on it. The OP hasn't!

So, perhaps the pathlib.Path class needs to have some way to take in a DirEntry produced by os.scandir() and a flag to allow it to cache stat() results? Then we could easily write a pathlib.walk() function that's like os.walk() but returning caching Path objects.

--
--Guido van Rossum (python.org/~guido)