[Python-ideas] find-like functionality in pathlib

Tue Dec 22 22:05:57 EST 2015

On Dec 22, 2015, at 16:23, Guido van Rossum <guido at python.org> wrote:
> 
> (Wow, what a rambling message. I'm not sure which part you hope to see addressed.)

I don't know that anything actually need to be addressed here at all. Struggling to see the real problem that needs to be solved means a bit of guesswork at what's relevant to the solution...

>> On Tue, Dec 22, 2015 at 1:54 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>> On Tuesday, December 22, 2015 12:14 PM, Guido van Rossum <guido at python.org> wrote:
>> 
>> >The UNIX find tool has many, many options.
>> 
>> 
>> I think a Pythonicized, stripped-down version of the basic design of fts (http://man7.org/linux/man-pages/man3/fts.3.html) is as simple as you're going to get. After all, fts was designed to make it as easy as possible to implement find efficiently.
> 
> The docs make no attempt at showing the common patterns. The API described looks horribly complex (I guess that's what you get when all that matters is efficient implementation).

Yes, that's why I gave a few examples, using my stripped-down and Pythonicized wrapper, so you don't have to work it all out from scratch by trying to read the manpage and guess how you'd use it in C. But the point is, that's what something as flexible as find looks like as a function.

> The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense, and remarkably often there *is* something where the two-level iteration helps (otherwise I'm sure you'd see lots of code that's trying to recover the directory by parsing the path and remembering the previous path and comparing the two).

Yes--as I said below, sometimes you really do want to go a directory at a time, and for that, it's hard to beat the API of os.walk. But when it's unnecessary, it makes the code look more complicated than necessary, so a flat iteration can be nicer. And, significantly, that, and the need to join all over the place, are the only things I can imagine that people would find worth "solving" about os.walk's API.

>> But more complex uses build on fts pretty readably:
>> 
>>     # find "$@" -H -xdev -type f -mtime 1 -iname '*.pyc' -exec do_stuff '{}' \;
>>     yesterday = datetime.now() - timedelta(days=1)
>>     with fts(top, stat=True, crossdev=False) as f:
>>         for path in f:
>>             if path.is_file and path.stat.st_mtime < yesterday and path.lower().endswith('.pyc'):
>>                 do_stuff(path)
> 
> Why does this use a with *and* a for-loop? Is there some terribly important cleanup that needs to happen when the for-loop is aborted?

Same reason this code uses with and a for loop:

    with open(path) as f:
        for line in f:
            do_stuff(line)

Cleaning up a file handle isn't _terribly_ important, but it's not _unimportant_, and isn't it generally a good habit?

> It also shows off the arbitrariness of the fts API -- fts() seems to have a bunch of random keyword args to control a variety of aspects of its behavior and the returned path objects look like they have a rather bizarre API: e.g. why is is_file a property on path, mtime a property on path.stat, and lower() a method on path directly? (And would path also have an endswith() method directly, in case I don't need to lowercase it?)

Explaining the details of the API design takes this even farther off-topic, but: my initial design was based on the same Path class that the stdlib's Path is: a subclass of str that adds attributes/properties for things that are immediately available and methods for things that aren't. (The names are de-abbreviated versions of the C names.) As for stat, for one thing, people already have code (and mental models) to deal with stat (named)tuples. Plus, if you request a fast walk without stat information (which often goes considerably faster than scandir--I've got a a Python tool that actually _beats_ the find invocation it replaced), or the stat on a file fails, I think it's clearer to have "stat" be None than to have 11-18 arbitrary attributes be None while the rest are still there.

At any rate, I was planning to take another pass at the design after finishing the Windows and generic implementations, but the project I was working on turned out to need this only for OS X, so I never got to that point.

> Of course that's can all be cleaned up easily enough -- it's a simple matter of API design.
>  
>> When you actually need to go a directory at a time, like the spool directory size example in the stdlib, os.walk is arguably nicer, but fortunately os.walk already exists.
> 
> I've never seen that example.

The first example under os.walk in the library docs is identical to the wiki spool example, except the first line points at subpackages of the stdlib email package instead of the top email spool directory, and an extra little bit was added at the end:

    for root, dirs, files in os.walk('python/Lib/email'):
        print(root, "consumes", end=" ")
        print(sum(getsize(join(root, name)) for name in files), end=" ")
        print("bytes in", len(files), "non-directory files")
        if 'CVS' in dirs:
            dirs.remove('CVS') # don't visit CVS directories

So, take that instead. Perfectly good example. And, while you could write that with a flat Iterator in a number of ways, none are going to be as simple as with two levels.

>> The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler.
> 
> Would it make sense to engage in a little duck typing and have an API that mimicked the API of Path objects but caches the stat() information? This could be built on top of scandir(), which provides some of the information without needing extra syscalls (depending on the platform). But even where a syscall() is still needed, this hypothetical Path-like object could cache the stat() result. If this type of result was only returned by a new hypothetical integration of os.walk() and pathlib, the caching would not be objectionable (it would simply be a limitation of the pathwalk API, rather than of the Path object).

The question is what code that uses (duck-typed) Path objects expects. I'm pretty sure there was extensive discussion of why Paths should never cache during the PEP 428 discussions, and I vaguely remember both Antoine Pitrou and Nick Coghlan giving good summaries more recently, but I don't remember enough details to say whether a duck-typed Path-like object would be just as bad. But I'm guessing it could have the same problems--if some function takes a Path object, stores it for later, and expects to use it to get live info, handing it something that quacks like a Path but returns snapshot info instead would be pretty insidious.

>> > ... But there are probably some
>> > common uses that deserve better direct support in e.g. the glob module. Would just a way
>> > to recursively search for matches using e.g. "**.txt" be sufficient? If not, can you
>> > specify what else you'd like? (Just " find-like" is too vague.)>--Guido (mobile)
>> 
>> pathlib already has a glob method, which handles '*/*.py' and even recursive '**/*.py' (and a match method to go with it). If that's sufficient, it's already there. Adding direct support for Path objects in the glob module would just be a second way to do the exact same thing. And honestly, if open, os.walk, etc. aren't going to work with Path objects, why should glob.glob?
> 
> Oh, I'd forgotten about pathlib.Path.rglob().

Or just Path.glob with ** in the pattern.

> Maybe the OP also didn't know about it?

So, did Antoine Pitrou already solve this problem 3 years ago (or Jason Orendorff many years before that), possibly barring a minor docs tweak, or is there still something to consider here?

> He claimed he just wanted to use regular expressions so he could exclude .git directories. To tell the truth, I don't have much sympathy for that: regular expressions are just too full of traps to make a good API for file matching, and it wouldn't even strictly be sufficient to filter the entire directory tree under .git unless you added matching on the entire path -- but then you'd still pay for the cost of traversing the .git tree even if your regex were to exclude it entirely, because the library wouldn't be able to introspect the regex to determine that for sure.

I agree with everything here. I believe Path.glob can do everything he needs, and what he asked for instead couldn't do any more.

It's dead-easy to imperatively apply a regex to decide whether to prune each dir in walk (or fts). Or to do the same to the joined path or the abspath. Or to use fnmatch instead of regex, or an arbitrary predicate function. Or to reverse the sense to mean only recurse on these instead of skip these. Imagine what a declarative API that allowed all that would look like. Even find doesn't have any of those options (at least not portably), and most people have to read guides to the manpage before they can read the manpage.

At any rate, there's no reason you couldn't add some regex methods to Path and/or special Path handling code to regex to make that imperative code slightly easier, but I don't see how "pattern.match(str(path))" is any worse than "os.scandir(str(path))" or "json.load(str(path))" or any of the zillion other places where you have to convert paths to strings explicitly, or what makes regex more inherently path-related than those things.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20151222/1f61b2bb/attachment-0001.html>