find-like functionality in pathlib
data:image/s3,"s3://crabby-images/84b8e/84b8ef72338e82fa9b51dd15562d726b05bdc7bd" alt=""
What do you think about implementing functionality similar to the `find` utility in Linux in the Pathlib module? I wanted this today, I had a script to write to archive a bunch of files from a folder, and I decided to try writing it in Python rather than in Bash. But I needed something stronger than `Path.glob` in order to select the files. I wanted a regular expression. (In this particular case, I wanted to get a list of all the files excluding the `.git` folder and all files inside of it. Thanks, Ram.
data:image/s3,"s3://crabby-images/79f63/79f63ef4e3b61ea53d216b83e3c12221336fc459" alt=""
In a message of Fri, 04 Dec 2015 21:00:47 +0200, Ram Rachum writes:
fnmatch https://docs.python.org/3.6/library/fnmatch.html wasn't sufficient for your needs? Laura
data:image/s3,"s3://crabby-images/84b8e/84b8ef72338e82fa9b51dd15562d726b05bdc7bd" alt=""
1. That would require going out of the pathlib framework. I can do that but it's more of a mess because then I need to convert the results back to Path objects. 2. Not sure how I would use fnmatch, because I wouldn't want to give it the list of all files recursively, since that would be a long list of files (lots of files in ".git" folder that I want to ignore.) I want it to first ignore everything in the ".git" folder completely without going over all the files, and then include all the other files recursively. On Fri, Dec 4, 2015 at 9:04 PM, Laura Creighton <lac@openend.se> wrote:
data:image/s3,"s3://crabby-images/c7e4c/c7e4c8efd2e64a9d78326eb21df4b68e38955c81" alt=""
On 04/12/15 19:08, Ram Rachum wrote:
Ram - os.walk() is probably the closest existing thing to what you want here (if it's called with topdown=True - the default - then you can remove the ".git" entry from the list of directories to prevent the walker from descending into that directory completely). I know: this is still stepping out of pathlib. However, it's probably what you want if you want to get something working soon ;) FWIW, this is not unrelated to my recent request for an os.walk() which returns the DirEntry objects - a thread that I am in the process of trying to summarise so that it doesn't drop off the RADAR (though it seems like this whole area is a can of worms ...). E
data:image/s3,"s3://crabby-images/a0158/a0158f39cfa5f57e13e5c95bfdd96446cf59500c" alt=""
Am 04.12.2015 um 20:00 schrieb Ram Rachum:
What do you think about implementing functionality similar to the `find` utility in Linux in the Pathlib module? I wanted this today, I had a script to write to archive a bunch of files from a folder, and I decided to try writing it in Python rather than in Bash. But I needed something stronger than `Path.glob` in order to select the files. I wanted a regular expression. (In this particular case, I wanted to get a list of all the files excluding the `.git` folder and all files inside of it.
Me, too. I miss a find like method. I use os.walk() since more than 10 years, but it still feels way too complicated. I asked about a library on softwarerecs some weeks ago: http://softwarerecs.stackexchange.com/questions/26296/python-library-for-tra... -- http://www.thomas-guettler.de/
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
The UNIX find tool has many, many options. For the general case it's probably easier to use os.walk(). But there are probably some common uses that deserve better direct support in e.g. the glob module. Would just a way to recursively search for matches using e.g. "**.txt" be sufficient? If not, can you specify what else you'd like? (Just " find-like" is too vague.) --Guido (mobile) On Dec 22, 2015 11:14 AM, "Thomas Güttler" <guettliml@thomas-guettler.de> wrote:
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Tuesday, December 22, 2015 12:14 PM, Guido van Rossum <guido@python.org> wrote:
The UNIX find tool has many, many options.
I think a Pythonicized, stripped-down version of the basic design of fts (http://man7.org/linux/man-pages/man3/fts.3.html) is as simple as you're going to get. After all, fts was designed to make it as easy as possible to implement find efficiently. In my incomplete Python wrapper around fts, the simplest use looks like: with fts(root) as f: for path in f: do_stuff(path) No two-level iteration, no need to join the root to the paths, no handling dirs and files separately. Of course for that basic use case, you could just write your own wrapper around os.walk: def flatwalk(*args, **kwargs): return (os.path.join(root, file) for file in files for root, dirs, files in os.walk(*args, **kwargs)) But more complex uses build on fts pretty readably: # find "$@" -H -xdev -type f -mtime 1 -iname '*.pyc' -exec do_stuff '{}' \; yesterday = datetime.now() - timedelta(days=1) with fts(top, stat=True, crossdev=False) as f: for path in f: if path.is_file and path.stat.st_mtime < yesterday and path.lower().endswith('.pyc'): do_stuff(path) When you actually need to go a directory at a time, like the spool directory size example in the stdlib, os.walk is arguably nicer, but fortunately os.walk already exists. The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler. Anyway, if you don't want either the efficiency or the simplicity, and just want an iterable of filenames or Paths, you might as well just use the wrapper around the existing os.walk that I wrote above. To make it works with Path objects: def flatpathwalk(root, *args, **kwargs): return map(path.Path, flatwalk(str(root), *args, **kwargs)) And then to use those Path objects: matches = (path for path in flatpathwalk(root) if pattern.match(str(path)))
pathlib already has a glob method, which handles '*/*.py' and even recursive '**/*.py' (and a match method to go with it). If that's sufficient, it's already there. Adding direct support for Path objects in the glob module would just be a second way to do the exact same thing. And honestly, if open, os.walk, etc. aren't going to work with Path objects, why should glob.glob? * Honestly, I think the problem here is that the pathlib module is just not useful. In a new language that used path objects--or, probably, URL objects--everywhere, it would be hard to design something better than pathlib, but as it is, while it's great for making really hairy path manipulation more readable, path manipulation never _gets_ really hairy, and os.path is already very well designed, and the fact that pathlib doesn't know how to interact with anything else in the stdlib or third-party code means that the wrapper stuff that constructs a Path on one end and calls str or bytes on the other end depending on which one you originally had adds as much complexity as you saved. But that's obviously off-topic here.
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
(Wow, what a rambling message. I'm not sure which part you hope to see addressed.) On Tue, Dec 22, 2015 at 1:54 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
The docs make no attempt at showing the common patterns. The API described looks horribly complex (I guess that's what you get when all that matters is efficient implementation).
The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense, and remarkably often there *is* something where the two-level iteration helps (otherwise I'm sure you'd see lots of code that's trying to recover the directory by parsing the path and remembering the previous path and comparing the two).
Why does this use a with *and* a for-loop? Is there some terribly important cleanup that needs to happen when the for-loop is aborted? It also shows off the arbitrariness of the fts API -- fts() seems to have a bunch of random keyword args to control a variety of aspects of its behavior and the returned path objects look like they have a rather bizarre API: e.g. why is is_file a property on path, mtime a property on path.stat, and lower() a method on path directly? (And would path also have an endswith() method directly, in case I don't need to lowercase it?) Of course that's can all be cleaned up easily enough -- it's a simple matter of API design.
I've never seen that example. But just a few days ago I wrote a little bit of code where the os.walk() API came in handy: for root, dirs, files in os.walk(arg): print("Scanning %s (%d files):" % (root, len(files))) for file in files: process(os.path.join(root, file)) (The point is not that we have access to dirs separately, but that we have the directories filtered out of the count of files.)
Would it make sense to engage in a little duck typing and have an API that mimicked the API of Path objects but caches the stat() information? This could be built on top of scandir(), which provides some of the information without needing extra syscalls (depending on the platform). But even where a syscall() is still needed, this hypothetical Path-like object could cache the stat() result. If this type of result was only returned by a new hypothetical integration of os.walk() and pathlib, the caching would not be objectionable (it would simply be a limitation of the pathwalk API, rather than of the Path object).
Oh, I'd forgotten about pathlib.Path.rglob(). Maybe the OP also didn't know about it? He claimed he just wanted to use regular expressions so he could exclude .git directories. To tell the truth, I don't have much sympathy for that: regular expressions are just too full of traps to make a good API for file matching, and it wouldn't even strictly be sufficient to filter the entire directory tree under .git unless you added matching on the entire path -- but then you'd still pay for the cost of traversing the .git tree even if your regex were to exclude it entirely, because the library wouldn't be able to introspect the regex to determine that for sure. He also insisted on staying withing the Path framework, which is an indication that maybe what we're really looking for here is the hybrid of walk/scandir/Path that I was trying to allude to above.
Seems the OP disagrees with you here -- he really wants to use pathlib (as was clear from his response to a suggestion to use fnmatch). Truly pushing for adoption of a new abstraction like this takes many years -- pathlib was new (and provisional) in 3.4 so it really hasn't been long enough to give up on it. The OP hasn't! So, perhaps the pathlib.Path class needs to have some way to take in a DirEntry produced by os.scandir() and a flag to allow it to cache stat() results? Then we could easily write a pathlib.walk() function that's like os.walk() but returning caching Path objects. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/c7e4c/c7e4c8efd2e64a9d78326eb21df4b68e38955c81" alt=""
On 23/12/15 00:23, Guido van Rossum wrote:
Yes please. I raised this recently in a thread that died (but with no negative responses - see below). I started looking at the various modules to try to bring the whole thing together into a reasonable proposal, but it was just a can of worms (glob, fnmatch, pathlib, os.scandir, os.walk, os.fwalk, fts ...). I'm afraid I don't have the free cycles to try to tackle that, so I ducked out. It would be great if all of that could be somehow brought together into a cohesive filesystem module. On 27/11/15 13:49, Eric Fahlgren wrote:
E.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Dec 22, 2015, at 16:23, Guido van Rossum <guido@python.org> wrote:
(Wow, what a rambling message. I'm not sure which part you hope to see addressed.)
I don't know that anything actually need to be addressed here at all. Struggling to see the real problem that needs to be solved means a bit of guesswork at what's relevant to the solution...
Yes, that's why I gave a few examples, using my stripped-down and Pythonicized wrapper, so you don't have to work it all out from scratch by trying to read the manpage and guess how you'd use it in C. But the point is, that's what something as flexible as find looks like as a function.
The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense, and remarkably often there *is* something where the two-level iteration helps (otherwise I'm sure you'd see lots of code that's trying to recover the directory by parsing the path and remembering the previous path and comparing the two).
Yes--as I said below, sometimes you really do want to go a directory at a time, and for that, it's hard to beat the API of os.walk. But when it's unnecessary, it makes the code look more complicated than necessary, so a flat iteration can be nicer. And, significantly, that, and the need to join all over the place, are the only things I can imagine that people would find worth "solving" about os.walk's API.
Same reason this code uses with and a for loop: with open(path) as f: for line in f: do_stuff(line) Cleaning up a file handle isn't _terribly_ important, but it's not _unimportant_, and isn't it generally a good habit?
It also shows off the arbitrariness of the fts API -- fts() seems to have a bunch of random keyword args to control a variety of aspects of its behavior and the returned path objects look like they have a rather bizarre API: e.g. why is is_file a property on path, mtime a property on path.stat, and lower() a method on path directly? (And would path also have an endswith() method directly, in case I don't need to lowercase it?)
Explaining the details of the API design takes this even farther off-topic, but: my initial design was based on the same Path class that the stdlib's Path is: a subclass of str that adds attributes/properties for things that are immediately available and methods for things that aren't. (The names are de-abbreviated versions of the C names.) As for stat, for one thing, people already have code (and mental models) to deal with stat (named)tuples. Plus, if you request a fast walk without stat information (which often goes considerably faster than scandir--I've got a a Python tool that actually _beats_ the find invocation it replaced), or the stat on a file fails, I think it's clearer to have "stat" be None than to have 11-18 arbitrary attributes be None while the rest are still there. At any rate, I was planning to take another pass at the design after finishing the Windows and generic implementations, but the project I was working on turned out to need this only for OS X, so I never got to that point.
The first example under os.walk in the library docs is identical to the wiki spool example, except the first line points at subpackages of the stdlib email package instead of the top email spool directory, and an extra little bit was added at the end: for root, dirs, files in os.walk('python/Lib/email'): print(root, "consumes", end=" ") print(sum(getsize(join(root, name)) for name in files), end=" ") print("bytes in", len(files), "non-directory files") if 'CVS' in dirs: dirs.remove('CVS') # don't visit CVS directories So, take that instead. Perfectly good example. And, while you could write that with a flat Iterator in a number of ways, none are going to be as simple as with two levels.
The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler.
Would it make sense to engage in a little duck typing and have an API that mimicked the API of Path objects but caches the stat() information? This could be built on top of scandir(), which provides some of the information without needing extra syscalls (depending on the platform). But even where a syscall() is still needed, this hypothetical Path-like object could cache the stat() result. If this type of result was only returned by a new hypothetical integration of os.walk() and pathlib, the caching would not be objectionable (it would simply be a limitation of the pathwalk API, rather than of the Path object).
The question is what code that uses (duck-typed) Path objects expects. I'm pretty sure there was extensive discussion of why Paths should never cache during the PEP 428 discussions, and I vaguely remember both Antoine Pitrou and Nick Coghlan giving good summaries more recently, but I don't remember enough details to say whether a duck-typed Path-like object would be just as bad. But I'm guessing it could have the same problems--if some function takes a Path object, stores it for later, and expects to use it to get live info, handing it something that quacks like a Path but returns snapshot info instead would be pretty insidious.
Or just Path.glob with ** in the pattern.
Maybe the OP also didn't know about it?
So, did Antoine Pitrou already solve this problem 3 years ago (or Jason Orendorff many years before that), possibly barring a minor docs tweak, or is there still something to consider here?
He claimed he just wanted to use regular expressions so he could exclude .git directories. To tell the truth, I don't have much sympathy for that: regular expressions are just too full of traps to make a good API for file matching, and it wouldn't even strictly be sufficient to filter the entire directory tree under .git unless you added matching on the entire path -- but then you'd still pay for the cost of traversing the .git tree even if your regex were to exclude it entirely, because the library wouldn't be able to introspect the regex to determine that for sure.
I agree with everything here. I believe Path.glob can do everything he needs, and what he asked for instead couldn't do any more. It's dead-easy to imperatively apply a regex to decide whether to prune each dir in walk (or fts). Or to do the same to the joined path or the abspath. Or to use fnmatch instead of regex, or an arbitrary predicate function. Or to reverse the sense to mean only recurse on these instead of skip these. Imagine what a declarative API that allowed all that would look like. Even find doesn't have any of those options (at least not portably), and most people have to read guides to the manpage before they can read the manpage. At any rate, there's no reason you couldn't add some regex methods to Path and/or special Path handling code to regex to make that imperative code slightly easier, but I don't see how "pattern.match(str(path))" is any worse than "os.scandir(str(path))" or "json.load(str(path))" or any of the zillion other places where you have to convert paths to strings explicitly, or what makes regex more inherently path-related than those things.
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
On Tue, Dec 22, 2015 at 4:23 PM, Guido van Rossum <guido@python.org> wrote:
The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense,
indeed, but not always, so a simple API that allows you to get a flat walk would be nice.... Of course for that basic use case, you could just write your own wrapper
around os.walk:
sure, but having to write "little" wrappers for common needs is unfortunate... The problem isn't designing a nice walk API; it's integrating it with
pathlib.*
indeed -- I'd really like to see a *walk in pathlib itself. I've been trying to use pathlib whenever I need, well, a path, but then I find I almost immediately need to step out and use an os.path function, and have to string-fy it anyway -- makes me wonder what the point is.. And honestly, if open, os.walk, etc. aren't going to work with Path
objects,
but they should -- of course they should..... Truly pushing for adoption of a new abstraction like this takes many years
-- pathlib was new (and provisional) in 3.4 so it really hasn't been long enough to give up on it. The OP hasn't!
it will take many years for sure -- but the standard library cold at least adopt it as much as possible. Path.walk would be a nice start :-) My example: one of our sysadmins wanted a little script to go thorugh an entire drive (Windows), and check if any paths were longer than 256 characters (Windows, remember..) I came up with this: def get_all_paths(start_dir='/'): for dirpath, dirnames, filenames in os.walk(start_dir): for filename in filenames: yield os.path.join(dirpath, filename) too_long = [] for p in get_all_paths('/'): print("checking:", p) if len(p) > 255: too_long.append(p) print("Path too long!") way too wordy! I started with pathlib, but that just made it worse. now that I think about it, maybe I could have simpily used pathlib.Path.rglob.... However, when I try that, I get a permission error: /Users/chris.barker/miniconda2/envs/py3/lib/python3.5/pathlib.py in wrapped(pathobj, *args) 369 @functools.wraps(strfunc) 370 def wrapped(pathobj, *args): --> 371 return strfunc(str(pathobj), *args) 372 return staticmethod(wrapped) 373 PermissionError: [Errno 13] Permission denied: '/Users/.chris.barker.xahome/caches/opendirectory' as the error comes insider the rglob() generator, I'm not sure how to tell it to ignore and move on.... os.walk is somehow able to deal with this. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Dec 28, 2015, at 11:25, Chris Barker <chris.barker@noaa.gov> wrote:
You're replying to me, not Guido, here... Anyway, if the only thing anyone will ever need is a handful of simple one-liners that even a novice could write, maybe it's reasonable to just add one to the docs to show how to do it, instead of adding them to the stdlib.
The problem isn't designing a nice walk API; it's integrating it with pathlib.*
indeed -- I'd really like to see a *walk in pathlib itself.
But first you have to solve the problem that paragraph was all about: a general-purpose walk API shouldn't be throwing away all that stat information it wasted time fetching, but the pathlib module is designed around Path objects that are always live, not snapshots. If Path.walk yields something that isn't a Path, what's the point?
I've been trying to use pathlib whenever I need, well, a path, but then I find I almost immediately need to step out and use an os.path function, and have to string-fy it anyway -- makes me wonder what the point is..
I have the same impression as you, but, as Guido says, let's give it time before judging...
And honestly, if open, os.walk, etc. aren't going to work with Path objects,
but they should -- of course they should.....
So far things have gone the opposite direction: open requires strings, but there's a Path.open method; walk requires strings, but people are proposing a Path.walk method; etc. I'm not sure how that's supposed to extend to things like json.load or NamedTemporaryFile.name.
Do you really want it to print out "Path too long!" hundreds of times? If not, this is a lot more concise, and I think readable, with comprehensions: walk = os.walk(start_dir) files = (os.path.join(root, file) for root, dirs, files in walk for file in files) too_long = (file for file in files if len(file) > 255) And now you've got a lazy Iterator over you too-long files. (If you need a list, just use a listcomp instead of a genexpr in the last step.)
way too wordy!
I started with pathlib, but that just made it worse.
If we had a Path.walk, I don't think it could be that much better than the original version, since the only thing Path can help with is making that join a bit shorter--and at the cost of having to convert to str to check len(): walk = start_path.Walk() files = (root / file for root, dirs, files in walk for file in files) too_long = (file for file in files if len(str(file)) > 255) As a side note, there's no Windows restriction to 255 _characters_, it's to 255 UTF-16 code points, just under 64K UTF-16 code points, or 255 codepage bytes, depending on which API you use. So you really want something like len(file.encode('utf-16') / 2) > 255. Also, I suspect you want either the bare filename or the abspath, not the path from the start dir (especially since a path rooted at the default '/' is two characters shorter than one rooted at 'C:\', so you're probably going to pass a bunch of files that then cause problems in your scripts).
data:image/s3,"s3://crabby-images/d321f/d321fa7003d562bee34e7f927e1ab5de19f84557" alt=""
Not sure how useful this is, but I ended up writing my own "pythonic find" module: https://github.com/moloney/pathmap/blob/master/pathmap.py I was mostly worried about minimizing stat calls, so I used scandir rather than Pathlib. The only documentation is the doc strings, but the basic idea is you can have one "matching" rule and any number of ignore/prune rules. The rules can be callables or strings that are treated as regular expressions (I suppose it might be better if the default was to treat strings as glob expressions instead...). So for the original use case that spawned this thread, you would do something like: pm = PathMap(prune_rules=['/\.git$']) for match in pm.matches(['path/to/some/dir']): if not match.dir_entry.is_dir(): print(match.path) Or if you wanted to do something similar but only print names of python modules it would be something like: pm = PathMap('.+/(.+)\.py$', prune_rules=['/\.git$']) for match in pm.matches(['path/to/some/dir']): if not match.dir_entry.is_dir(): print(match.match_info[1])
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
On Mon, Dec 28, 2015 at 2:43 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
I was intending to reply to the list :-)
well, it's a four liner, yes? but I'm not sure i agree -- the simple things should be simple. even if you can find the couple-liner in the docs, you've still got a lot more overhead than calling a ready-to-go function. and it's not like it'd be a heavy maintenance burden.... The problem isn't designing a nice walk API; it's integrating it with
pathlib.*
indeed -- I'd really like to see a *walk in pathlib itself. But first you have to solve the problem that paragraph was all about: a general-purpose walk API shouldn't be throwing away all that stat information it wasted time fetching, but the pathlib module is designed around Path objects that are always live, not snapshots. If Path.walk yields something that isn't a Path, what's the point? OK -- you've gotten out of my technical depth now.....so I'll just shut up. But at the end of the day, if you've got the few-liner in the docs that works, maybe it's OK that it's not optimized..... I've been trying to use pathlib whenever I need, well, a path, but then I find I almost immediately need to step out and use an os.path function, and have to string-fy it anyway -- makes me wonder what the point is..
I have the same impression as you, but, as Guido says, let's give it time before judging...
time good -- but also maybe some more work to make it easy to use with rest of the stdlib. I will say that one thing that bugs me about the "old style" os.path functions is that I find myself stringing tehm together, and that gets really ugly fast: my_path - os.path.join(os.path.split(something)[0], something_else) here's where an OO interface is much nicer. And honestly, if open, os.walk, etc. aren't going to work with Path
objects,
but they should -- of course they should..... So far things have gone the opposite direction: open requires strings, but there's a Path.open method; This sure feels to me like the wrong way to go -- too OO -heavy: create a Path object, then use it to open a file. which is why we still have the regular old open() that takes strings. I just finished teaching an intro to Python class, using py3 for the first time -- I found myself pointing students to pathlib, but then never using it in any examples, etc. That may be my old habits, but I really think we do have an ugly mix of APIs here.
walk requires strings, but people are proposing a Path.walk method; etc.
well, walk "feels" to me like a path-y operation. whereas open() does not. I'm not sure how that's supposed to extend to things like json.load or NamedTemporaryFile.name. exactly -- that's why open() doesn't feel path-y to me. you have all sorts of places where you might want to open a file, and you want to open other things as well. And I like APIs that let you pass in either an open file-like object, OR a path -- so it seems allowing either a Path object or a path-in-a-string would be good. so my "proposal" is to go through the stdlib and add the ability to accept a Path object everywhere a string path is accepted. (hmm -- could you simply wrap str() around the input?) My example: one of our sysadmins wanted a little script to go thorugh an entire drive (Windows), and check if any paths were longer than 256 characters (Windows, remember..) I came up with this: def get_all_paths(start_dir='/'): for dirpath, dirnames, filenames in os.walk(start_dir): for filename in filenames: yield os.path.join(dirpath, filename) too_long = [] for p in get_all_paths('/'): print("checking:", p) if len(p) > 255: too_long.append(p) print("Path too long!")
Do you really want it to print out "Path too long!" hundreds of times?
well, not in production, no, but was nice to test -- also, in theory, there shouldn't be many!
If not, this is a lot more concise, and I think readable, with comprehensions:
walk = os.walk(start_dir) files = (os.path.join(root, file) for root, dirs, files in walk for file in files) too_long = (file for file in files if len(file) > 255) thanks -- should have thought of that -- though that was to pass off to a sysadmin that doesn't know much python -- harder for him to read??
yup -- probably I'd write it out to a file in the real use case. or stdout. way too wordy! I started with pathlib, but that just made it worse.
If we had a Path.walk, I don't think it could be that much better than the original version,
sure -- the wordyness comes from the fact that you have to deal with dirs and files separately.
since the only thing Path can help with is making that join a bit shorter--and at the cost of having to convert to str to check len():
maybe another argument for why Path doesn't buy much over string paths...
what I really want here is: too_long = (filepath for filepath in Path(root) if len(filepath) > 255 ) I know python isn't a shell scripting language but it is a one liner in powershell or bash, or.... As a side note, there's no Windows restriction to 255 _characters_, it's to
255 UTF-16 code points,
IIUC, Windows itself, nor ntfs has this restriction, but some older utilities do -- really pathetic. And I asked our sysadmin about the unicode issue, and he hasd no idea.
just under 64K UTF-16 code points,
how is a codepoint different than a character???? I was wondering if it was a bytes restriction or codepoint restriction?
or 255 codepage bytes, depending on which API you use.
this is where it gets ugly -- who knows what API some utility is using??? So you really want something like len(file.encode('utf-16') / 2) > 255. but can't some characters use more than 2 bytes in utf-16? or is that what you're trying to catch here? Also, I suspect you want either the bare filename or the abspath, not the
path from the start dir (especially since a path rooted at the default '/' is two characters shorter than one rooted at 'C:\',
well, the startdir would be C:\ and now I'm confused about whether the "C:\" is parto f the 255-something restriction! anyway, WAY OT -- and if this is used it will be mainly to flag potential problems, not really a robust test. Thanks, -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
data:image/s3,"s3://crabby-images/2dd36/2dd36bc2d30d53161737124e2d8ace2b4b4ce052" alt=""
On Dec 28, 2015 2:33 PM, "Chris Barker" <chris.barker@noaa.gov> wrote:
On Tue, Dec 22, 2015 at 4:23 PM, Guido van Rossum <guido@python.org>
wrote:
The path.py .walk* APIs work great w/ fnmatch: https://pythonhosted.org/path.py/api.html#path.Path.walk https://pythonhosted.org/path.py/api.html#path.Path.walkdirs https://pythonhosted.org/path.py/api.html#path.Path.walkfiles
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Dec 28, 2015, at 16:50, Wes Turner <wes.turner@gmail.com> wrote:
The path module has some major differences. First, because it doesn't use scandir or anything else to avoid multiple stat calls, the caching issue doesn't come up. Also, because its Path subclasses str, it doesn't have the same usability issues (you can pass a Path straight to json.loads, for example), although of course that gives it different usability issues (e.g., inherited methods like Path.count are an obvious attractive nuisance). Also, it doesn't handle case sensitivity as automagically. Also, it's definitely the kind of "kitchen sink" design that got PEP 355 rejected (which often makes sense for a third-party lib even when it doesn't for a stdlib module). So, not everything that makes sense for path will also make sense for pathlib. But it's still worth looking at.
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
Following up on this, in theory the right way to walk a tree using pathlib already exists, it's the rglob() method. E.g. all paths under /foo/bar should be found as follows: for path in pathlib.Path('/foo/bar').rglob('**/*'): print(path) The PermissionError bug you found is already reported: http://bugs.python.org/issue24120 -- it even has a patch but it's stuck in review. Sadly there's another error: loops introduced by symlinks cause infinite recursion. I filed that here: http://bugs.python.org/issue26012. (The fix should be judicious use of is_symlink(), but the code is a little convoluted.) On Mon, Dec 28, 2015 at 11:25 AM, Chris Barker <chris.barker@noaa.gov> wrote:
-- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Mon, Jan 4, 2016 at 9:25 PM, Guido van Rossum <guido@python.org> wrote:
Whoops, I just realized that I combined two ways of doing a recursive glob here. It should be either rglob('*') or plain glob('**/*'). What I wrote produces identical results, but at the cost of a lot of caching. :-) Note that the PEP doesn't mention rglob() -- why do we even have it? It seems rglob(pat) is exactly the same as glob('**/' + path) (assuming os.sep is '/'). No TOOWTDI here? -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
Note that the PEP doesn't mention rglob() -- why do we even have it? It seems rglob(pat) is exactly the same as glob('**/' + path) (assuming os.sep is '/'). No TOOWTDI here?
Much as I believe in TOOWTDI, I like having rglob(). "**/" is the kind of magic a newbie ( like me :-) ) would have research and understand. -CHB
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Tue, Jan 5, 2016 at 8:37 AM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
Sure. It's too late to remove it anyway. Is there anything actionable here besides fixing the PermissionError and the behavior under symlink loops? IMO if you want files only or directories only you can just add a filter using e.g. is_dir(): p = pathlib.Path.cwd() real_dirs = [p for p in p.rglob('*') if p.is_dir() and not p.is_symlink()] -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/d321f/d321fa7003d562bee34e7f927e1ab5de19f84557" alt=""
The main issue is the lack of stat caching. That is why I wrote my own module around scandir which includes the DirEntry objects for each path so that the consumer can also do stuff with the cached stat info (like check if it is a file or directory). Often we won't need to call stat on the path at all, and if we do it will only be once. Brendan Moloney Research Associate Advanced Imaging Research Center Oregon Health Science University ________________________________ From: Python-ideas [python-ideas-bounces+moloney=ohsu.edu@python.org] on behalf of Guido van Rossum [guido@python.org] Sent: Tuesday, January 05, 2016 12:21 PM To: Chris Barker - NOAA Federal Cc: Python-Ideas Subject: Re: [Python-ideas] find-like functionality in pathlib On Tue, Jan 5, 2016 at 8:37 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov<mailto:chris.barker@noaa.gov>> wrote:
Note that the PEP doesn't mention rglob() -- why do we even have it? It seems rglob(pat) is exactly the same as glob('**/' + path) (assuming os.sep is '/'). No TOOWTDI here?
Much as I believe in TOOWTDI, I like having rglob(). "**/" is the kind of magic a newbie ( like me :-) ) would have research and understand. Sure. It's too late to remove it anyway. Is there anything actionable here besides fixing the PermissionError and the behavior under symlink loops? IMO if you want files only or directories only you can just add a filter using e.g. is_dir(): p = pathlib.Path.cwd() real_dirs = [p for p in p.rglob('*') if p.is_dir() and not p.is_symlink()] -- --Guido van Rossum (python.org/~guido<http://python.org/~guido>)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Tue, Jan 5, 2016 at 12:27 PM, Brendan Moloney <moloney@ohsu.edu> wrote:
I wonder if stat() caching shouldn't be made an orthogonal optional feature of Path objects somehow; it keeps coming back as useful in various cases even though we don't want to enable it by default. One problem with stat() caching is that Path objects are considered immutable, and two Path objects referring to the same path are completely interchangeable. For example, {pathlib.Path('/a'), pathlib.Path('/a')} is a set of length 1: {PosixPath('/a')}. But if we had e.g. Path('/a', cache_stat=True), the behavior of two instances of that object might be observably different (if they were instantiated at times when the contents of the filesystem was different). So maybe stat-caching Path instances should be considered unequal, or perhaps unhashable. Or perhaps they should only be considered equal if their stat() values are actually equal (i.e. if the file's stat() info didn't change). . So this is a thorny issue that requires some real thought before we commit to an API. We might also want to create Path instances directly from DirEntry objects. (Interesting, the DirEntry API seems to be a subset of the Path API, except for the .path attribute which is equivalent to the str() of a Path object.) Maybe some of this can be done first as a 3rd party module forked from the original 3rd party pathlib? https://bitbucket.org/pitrou/pathlib/ seems reasonably up to date. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/2658f/2658f17e607cac9bc627d74487bef4b14b9bfee8" alt=""
Guido van Rossum wrote:
Maybe path.stat() could return a PathWithStat object that inherits from Path and can do everything that a Path can do, but also contains cached stat info and has a suitable set of attributes for accessing it. This would make it clear at what point in time the info is valid for, i.e. the moment you called stat(). It would also provide an obvious way to refresh the info: calling path_with_stat.stat() would give you a new PathWithStat containing updated info. Things like scandir could then return pre-populated PathWithStat objects. -- Greg
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Tue, Jan 5, 2016 at 2:59 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Well, Path.stat() is already defined and returns the same type of object that os.stat() returns, and I don't think we should change that. We could add a new method that does this, but as long as it inherits from Path it wouldn't really address the issue with objects being == to each other but holding different stat info.
I presume you are proposing a new Path.scandir() method -- the existing os.scandir() method already returns DirEntry objects which we really don't want to change at this point. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Wed, Jan 6, 2016 at 8:11 AM, Random832 <random832@fastmail.com> wrote:
It would have to use a weak dict so if the last reference goes away it discards the cached stats for a given path, otherwise you'd have trouble containing the cache size. And caching Path objects should still not be comparable to non-caching Path objects (which we will need to preserve the semantics that repeatedly calling stat() on a Path object created the default way will always redo the syscall). The main advantage would be that caching Path objects could be compared safely. It could still cause unexpected results. E.g. if you have just traversed some big tree using caching, and saved some results (so hanging on to some paths and hence their stat() results), and then you make some changes and traverse it again to look for something else, you might accidentally be seeing stale (i.e. cached) stat() results. Maybe there's a middle ground, where the user can create a StatCache object and pass it into Path creation and traversal operations. Paths with the same StatCache object (or both None) compare equal if their path components are equal. Paths with different StatCache objects never compare equal (but otherwise are ordered by path as usual -- the StatCache object's identity is only used when the paths are equal. Are you (or anyone still reading this) interested in implementing this idea? -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
I couldn't help myself and coded up a prototype for the StatCache design I sketched. See http://bugs.python.org/issue26031. Feedback welcome! On my Mac it only seems to offer limited benefits though... On Wed, Jan 6, 2016 at 8:48 AM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/d321f/d321fa7003d562bee34e7f927e1ab5de19f84557" alt=""
Its important to keep in mind the main benefit of scandir is you don't have to do ANY stat call in many cases, because the directory listing provides some subset of this info. On Linux you can at least tell if a path is a file or directory. On windows there is much more info provided by the directory listing. Avoiding subsequent stat calls is also nice, but not nearly as important due to OS level caching. Brendan Moloney Research Associate Advanced Imaging Research Center Oregon Health Science University ________________________________ From: Python-ideas [python-ideas-bounces+moloney=ohsu.edu@python.org] on behalf of Guido van Rossum [guido@python.org] Sent: Wednesday, January 06, 2016 2:42 PM To: Random832 Cc: Python-Ideas Subject: Re: [Python-ideas] find-like functionality in pathlib I couldn't help myself and coded up a prototype for the StatCache design I sketched. See http://bugs.python.org/issue26031. Feedback welcome! On my Mac it only seems to offer limited benefits though...
data:image/s3,"s3://crabby-images/f81c3/f81c349b494ddf4b2afda851969a1bfe75852ddf" alt=""
On Wed, Jan 6, 2016 at 3:05 PM Brendan Moloney <moloney@ohsu.edu> wrote:
+1 - this was one of the two primary motivations behind scandir. Anything trying to reimplement a filesystem tree walker without using scandir is going to have sub-standard performance. If we ever offer anything with "find like functionality" related to pathlib, it *needs* to be based on scandir. Anything else would just be repeating the convenient but untrue limiting assumptions of os.listdir: That the contents of a directory can be loaded into memory and that we don't mind re-querying the OS for stat information that it already gave us but we threw away as part of reading the directory. -gps
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 11 January 2016 at 18:57, Gregory P. Smith <greg@krypto.org> wrote:
This is very much why I feel that we need something in pathlib. I understand the motivation for not caching stat information in path objects. And I don't have a viable design for how a "find-like functionality" API should be implemented in pathlib. But as it stands, I feel as though using pathlib for anything that does bulk filesystem scans is deliberately choosing something that I know won't scale well. So (in my mind) pathlib doesn't fulfil the role of "one obvious way to do things". Which is a shame, because Path.rglob is very often far closer to what I need in my programs than os.walk (even when it's just rootpath.rglob('*')). In practice, by far the most common need I have[1] for filetree walking is to want to get back a list of all the names of files starting at a particular directory with the returned filenames *relative to the given root*. Pathlib.rglob gives absolute pathnames. os.walk gives the absolute directory name and the base filename. Neither is what I want, although obviously in both cases it's pretty trivial to extract the "relative to the root" part from the returned data. But an API that gave that information directly, with scandir-level speed and scalability, in the form of pathlib.Path relative path objects, would be ideal for me[1]. Paul [1] And yes, I know this means I should just write a utility function for it :-) [2] The feature creep starts when people want to control things like pruning particular directories such as '.git', or only matching particular glob patterns, or choosing whether or not to include directories in the output, or... Adding *those* features without ending up with a Frankenstein's monster of an API is the challenge :-)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Mon, Jan 11, 2016 at 10:57 AM, Gregory P. Smith <greg@krypto.org> wrote:
And we already have this in the form of pathlib's [r]glob() methods. There's a patch to the glob module in http://bugs.python.org/issue25596 and as soon as that's committed I hope that its author(s) will work on doing a similar patch for pathlib's [r]glob (tracking this in http://bugs.python.org/issue26032). -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/2dd36/2dd36bc2d30d53161737124e2d8ace2b4b4ce052" alt=""
A bit OT, possibly, but this may be a long way around (to a cached *graph* of paths and metadata) with similar use cases: path.py#walk(), NetworkX edge, node dicts https://github.com/westurner/pyleset/blob/249a0837/structp/structp.py def walk_path_into_graph(g, path_, errors='warn'): """ """ This stats and reads limited image format metadata as CSV, TSV, JSON: https://github.com/westurner/image_size/blob/ab46de73/get_image_size.py I suppose because of race conditions this metadata should actually be stored in a filesystem triplestore with extended attributes and also secontext attributes. (... gnome-tracker reads filesystem stat data into RDF, for SPARQL). BSP vertex messaging can probably handle cascading cache invalidation (with supersteps). On Jan 6, 2016 4:44 PM, "Guido van Rossum" <guido@python.org> wrote:
data:image/s3,"s3://crabby-images/2dd36/2dd36bc2d30d53161737124e2d8ace2b4b4ce052" alt=""
The PyFilesystem filesystem abstraction APIs may also have / be in need of a sensible .walk() API http://pyfilesystem.readthedocs.org/en/latest/path.html#module-fs.path http://pyfilesystem.readthedocs.org/en/latest/interface.html walk() Like listdir() but descends in to sub-directories walkdirs() Returns an iterable of paths to sub-directories walkfiles() Returns an iterable of file paths in a directory, and its sub-directories On Jan 7, 2016 3:03 AM, "Wes Turner" <wes.turner@gmail.com> wrote:
data:image/s3,"s3://crabby-images/291c0/291c0867ef7713a6edb609517b347604a575bf5e" alt=""
Don't get me wrong but either glob('**/*') and rglob('*') sounds quite cryptic. Furthermore, globbing always sounds slow to me. Is it fast? And is there some way to leave out the '*' (three special characters for plain ol'everything)? And how can I walk directories only and files only? On 05.01.2016 07:49, Guido van Rossum wrote:
Best, Sven
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
Thanks for following up. it's the rglob() method. E.g. all paths under /foo/bar should be found as follows: for path in pathlib.Path('/foo/bar').rglob('**/*'): print(path) The PermissionError bug you found is already reported: http://bugs.python.org/issue24120 -- it even has a patch but it's stuck in review. Thanks for pinging that -- I had somehow assumed that the PermissionError was intentional. Sadly there's another error: loops introduced by symlinks cause infinite recursion. I filed that here: http://bugs.python.org/issue26012. (The fix should be judicious use of is_symlink(), but the code is a little convoluted.) Thanks, -CHB On Mon, Dec 28, 2015 at 11:25 AM, Chris Barker <chris.barker@noaa.gov> wrote:
-- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Mon, Jan 4, 2016 at 9:25 PM, Guido van Rossum <guido@python.org> wrote:
[actually, rglob('*') or glob('**/*')]
I committed this fix.
I committed a fix for this too (turned out to need just one call to is_symlink()). I also added a .path attribute to pathlib.*Path objects, so that p.path == str(p). You can now use the idiom getattr(arg, 'path', arg) to extract the path from a pathlib.Path object, or from an os.DirEntry object, or fall back to a plain string, without using str(arg), which would turn *any* object into a string, which is never what you want to happen by default. These changes will be released in Python 3.4.5, 3.5.2 and 3.6. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
The PermissionError bug you found is already reported
I committed this fix. Thanks! I also added a .path attribute to pathlib.*Path objects, so that p.path == str(p). You can now use the idiom getattr(arg, 'path', arg) to extract the path from a pathlib.Path object, or from an os.DirEntry object, or fall back to a plain string, without using str(arg), which would turn *any* object into a string, which is never what you want to happen by default. Very nice -- that opens the door to stdlib and third party modules taking Path objects in addition to strings. Maybe we will see greater adoption of pathlib after all! CHB These changes will be released in Python 3.4.5, 3.5.2 and 3.6. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/79f63/79f63ef4e3b61ea53d216b83e3c12221336fc459" alt=""
In a message of Fri, 04 Dec 2015 21:00:47 +0200, Ram Rachum writes:
fnmatch https://docs.python.org/3.6/library/fnmatch.html wasn't sufficient for your needs? Laura
data:image/s3,"s3://crabby-images/84b8e/84b8ef72338e82fa9b51dd15562d726b05bdc7bd" alt=""
1. That would require going out of the pathlib framework. I can do that but it's more of a mess because then I need to convert the results back to Path objects. 2. Not sure how I would use fnmatch, because I wouldn't want to give it the list of all files recursively, since that would be a long list of files (lots of files in ".git" folder that I want to ignore.) I want it to first ignore everything in the ".git" folder completely without going over all the files, and then include all the other files recursively. On Fri, Dec 4, 2015 at 9:04 PM, Laura Creighton <lac@openend.se> wrote:
data:image/s3,"s3://crabby-images/c7e4c/c7e4c8efd2e64a9d78326eb21df4b68e38955c81" alt=""
On 04/12/15 19:08, Ram Rachum wrote:
Ram - os.walk() is probably the closest existing thing to what you want here (if it's called with topdown=True - the default - then you can remove the ".git" entry from the list of directories to prevent the walker from descending into that directory completely). I know: this is still stepping out of pathlib. However, it's probably what you want if you want to get something working soon ;) FWIW, this is not unrelated to my recent request for an os.walk() which returns the DirEntry objects - a thread that I am in the process of trying to summarise so that it doesn't drop off the RADAR (though it seems like this whole area is a can of worms ...). E
data:image/s3,"s3://crabby-images/a0158/a0158f39cfa5f57e13e5c95bfdd96446cf59500c" alt=""
Am 04.12.2015 um 20:00 schrieb Ram Rachum:
What do you think about implementing functionality similar to the `find` utility in Linux in the Pathlib module? I wanted this today, I had a script to write to archive a bunch of files from a folder, and I decided to try writing it in Python rather than in Bash. But I needed something stronger than `Path.glob` in order to select the files. I wanted a regular expression. (In this particular case, I wanted to get a list of all the files excluding the `.git` folder and all files inside of it.
Me, too. I miss a find like method. I use os.walk() since more than 10 years, but it still feels way too complicated. I asked about a library on softwarerecs some weeks ago: http://softwarerecs.stackexchange.com/questions/26296/python-library-for-tra... -- http://www.thomas-guettler.de/
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
The UNIX find tool has many, many options. For the general case it's probably easier to use os.walk(). But there are probably some common uses that deserve better direct support in e.g. the glob module. Would just a way to recursively search for matches using e.g. "**.txt" be sufficient? If not, can you specify what else you'd like? (Just " find-like" is too vague.) --Guido (mobile) On Dec 22, 2015 11:14 AM, "Thomas Güttler" <guettliml@thomas-guettler.de> wrote:
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Tuesday, December 22, 2015 12:14 PM, Guido van Rossum <guido@python.org> wrote:
The UNIX find tool has many, many options.
I think a Pythonicized, stripped-down version of the basic design of fts (http://man7.org/linux/man-pages/man3/fts.3.html) is as simple as you're going to get. After all, fts was designed to make it as easy as possible to implement find efficiently. In my incomplete Python wrapper around fts, the simplest use looks like: with fts(root) as f: for path in f: do_stuff(path) No two-level iteration, no need to join the root to the paths, no handling dirs and files separately. Of course for that basic use case, you could just write your own wrapper around os.walk: def flatwalk(*args, **kwargs): return (os.path.join(root, file) for file in files for root, dirs, files in os.walk(*args, **kwargs)) But more complex uses build on fts pretty readably: # find "$@" -H -xdev -type f -mtime 1 -iname '*.pyc' -exec do_stuff '{}' \; yesterday = datetime.now() - timedelta(days=1) with fts(top, stat=True, crossdev=False) as f: for path in f: if path.is_file and path.stat.st_mtime < yesterday and path.lower().endswith('.pyc'): do_stuff(path) When you actually need to go a directory at a time, like the spool directory size example in the stdlib, os.walk is arguably nicer, but fortunately os.walk already exists. The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler. Anyway, if you don't want either the efficiency or the simplicity, and just want an iterable of filenames or Paths, you might as well just use the wrapper around the existing os.walk that I wrote above. To make it works with Path objects: def flatpathwalk(root, *args, **kwargs): return map(path.Path, flatwalk(str(root), *args, **kwargs)) And then to use those Path objects: matches = (path for path in flatpathwalk(root) if pattern.match(str(path)))
pathlib already has a glob method, which handles '*/*.py' and even recursive '**/*.py' (and a match method to go with it). If that's sufficient, it's already there. Adding direct support for Path objects in the glob module would just be a second way to do the exact same thing. And honestly, if open, os.walk, etc. aren't going to work with Path objects, why should glob.glob? * Honestly, I think the problem here is that the pathlib module is just not useful. In a new language that used path objects--or, probably, URL objects--everywhere, it would be hard to design something better than pathlib, but as it is, while it's great for making really hairy path manipulation more readable, path manipulation never _gets_ really hairy, and os.path is already very well designed, and the fact that pathlib doesn't know how to interact with anything else in the stdlib or third-party code means that the wrapper stuff that constructs a Path on one end and calls str or bytes on the other end depending on which one you originally had adds as much complexity as you saved. But that's obviously off-topic here.
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
(Wow, what a rambling message. I'm not sure which part you hope to see addressed.) On Tue, Dec 22, 2015 at 1:54 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
The docs make no attempt at showing the common patterns. The API described looks horribly complex (I guess that's what you get when all that matters is efficient implementation).
The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense, and remarkably often there *is* something where the two-level iteration helps (otherwise I'm sure you'd see lots of code that's trying to recover the directory by parsing the path and remembering the previous path and comparing the two).
Why does this use a with *and* a for-loop? Is there some terribly important cleanup that needs to happen when the for-loop is aborted? It also shows off the arbitrariness of the fts API -- fts() seems to have a bunch of random keyword args to control a variety of aspects of its behavior and the returned path objects look like they have a rather bizarre API: e.g. why is is_file a property on path, mtime a property on path.stat, and lower() a method on path directly? (And would path also have an endswith() method directly, in case I don't need to lowercase it?) Of course that's can all be cleaned up easily enough -- it's a simple matter of API design.
I've never seen that example. But just a few days ago I wrote a little bit of code where the os.walk() API came in handy: for root, dirs, files in os.walk(arg): print("Scanning %s (%d files):" % (root, len(files))) for file in files: process(os.path.join(root, file)) (The point is not that we have access to dirs separately, but that we have the directories filtered out of the count of files.)
Would it make sense to engage in a little duck typing and have an API that mimicked the API of Path objects but caches the stat() information? This could be built on top of scandir(), which provides some of the information without needing extra syscalls (depending on the platform). But even where a syscall() is still needed, this hypothetical Path-like object could cache the stat() result. If this type of result was only returned by a new hypothetical integration of os.walk() and pathlib, the caching would not be objectionable (it would simply be a limitation of the pathwalk API, rather than of the Path object).
Oh, I'd forgotten about pathlib.Path.rglob(). Maybe the OP also didn't know about it? He claimed he just wanted to use regular expressions so he could exclude .git directories. To tell the truth, I don't have much sympathy for that: regular expressions are just too full of traps to make a good API for file matching, and it wouldn't even strictly be sufficient to filter the entire directory tree under .git unless you added matching on the entire path -- but then you'd still pay for the cost of traversing the .git tree even if your regex were to exclude it entirely, because the library wouldn't be able to introspect the regex to determine that for sure. He also insisted on staying withing the Path framework, which is an indication that maybe what we're really looking for here is the hybrid of walk/scandir/Path that I was trying to allude to above.
Seems the OP disagrees with you here -- he really wants to use pathlib (as was clear from his response to a suggestion to use fnmatch). Truly pushing for adoption of a new abstraction like this takes many years -- pathlib was new (and provisional) in 3.4 so it really hasn't been long enough to give up on it. The OP hasn't! So, perhaps the pathlib.Path class needs to have some way to take in a DirEntry produced by os.scandir() and a flag to allow it to cache stat() results? Then we could easily write a pathlib.walk() function that's like os.walk() but returning caching Path objects. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/c7e4c/c7e4c8efd2e64a9d78326eb21df4b68e38955c81" alt=""
On 23/12/15 00:23, Guido van Rossum wrote:
Yes please. I raised this recently in a thread that died (but with no negative responses - see below). I started looking at the various modules to try to bring the whole thing together into a reasonable proposal, but it was just a can of worms (glob, fnmatch, pathlib, os.scandir, os.walk, os.fwalk, fts ...). I'm afraid I don't have the free cycles to try to tackle that, so I ducked out. It would be great if all of that could be somehow brought together into a cohesive filesystem module. On 27/11/15 13:49, Eric Fahlgren wrote:
E.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Dec 22, 2015, at 16:23, Guido van Rossum <guido@python.org> wrote:
(Wow, what a rambling message. I'm not sure which part you hope to see addressed.)
I don't know that anything actually need to be addressed here at all. Struggling to see the real problem that needs to be solved means a bit of guesswork at what's relevant to the solution...
Yes, that's why I gave a few examples, using my stripped-down and Pythonicized wrapper, so you don't have to work it all out from scratch by trying to read the manpage and guess how you'd use it in C. But the point is, that's what something as flexible as find looks like as a function.
The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense, and remarkably often there *is* something where the two-level iteration helps (otherwise I'm sure you'd see lots of code that's trying to recover the directory by parsing the path and remembering the previous path and comparing the two).
Yes--as I said below, sometimes you really do want to go a directory at a time, and for that, it's hard to beat the API of os.walk. But when it's unnecessary, it makes the code look more complicated than necessary, so a flat iteration can be nicer. And, significantly, that, and the need to join all over the place, are the only things I can imagine that people would find worth "solving" about os.walk's API.
Same reason this code uses with and a for loop: with open(path) as f: for line in f: do_stuff(line) Cleaning up a file handle isn't _terribly_ important, but it's not _unimportant_, and isn't it generally a good habit?
It also shows off the arbitrariness of the fts API -- fts() seems to have a bunch of random keyword args to control a variety of aspects of its behavior and the returned path objects look like they have a rather bizarre API: e.g. why is is_file a property on path, mtime a property on path.stat, and lower() a method on path directly? (And would path also have an endswith() method directly, in case I don't need to lowercase it?)
Explaining the details of the API design takes this even farther off-topic, but: my initial design was based on the same Path class that the stdlib's Path is: a subclass of str that adds attributes/properties for things that are immediately available and methods for things that aren't. (The names are de-abbreviated versions of the C names.) As for stat, for one thing, people already have code (and mental models) to deal with stat (named)tuples. Plus, if you request a fast walk without stat information (which often goes considerably faster than scandir--I've got a a Python tool that actually _beats_ the find invocation it replaced), or the stat on a file fails, I think it's clearer to have "stat" be None than to have 11-18 arbitrary attributes be None while the rest are still there. At any rate, I was planning to take another pass at the design after finishing the Windows and generic implementations, but the project I was working on turned out to need this only for OS X, so I never got to that point.
The first example under os.walk in the library docs is identical to the wiki spool example, except the first line points at subpackages of the stdlib email package instead of the top email spool directory, and an extra little bit was added at the end: for root, dirs, files in os.walk('python/Lib/email'): print(root, "consumes", end=" ") print(sum(getsize(join(root, name)) for name in files), end=" ") print("bytes in", len(files), "non-directory files") if 'CVS' in dirs: dirs.remove('CVS') # don't visit CVS directories So, take that instead. Perfectly good example. And, while you could write that with a flat Iterator in a number of ways, none are going to be as simple as with two levels.
The problem isn't designing a nice walk API; it's integrating it with pathlib.* It seems fundamental to the design of pathlib that Path objects never cache anything. But the whole point of using something like fts is to do as few filesystem calls as possible to get the information you need; if it throws away everything it did and forces you to retrieve the same information gain (possibly even in a less efficient way), that kind of defeats the purpose. Even besides efficiency, having those properties all nicely organized and ready for you can make the code simpler.
Would it make sense to engage in a little duck typing and have an API that mimicked the API of Path objects but caches the stat() information? This could be built on top of scandir(), which provides some of the information without needing extra syscalls (depending on the platform). But even where a syscall() is still needed, this hypothetical Path-like object could cache the stat() result. If this type of result was only returned by a new hypothetical integration of os.walk() and pathlib, the caching would not be objectionable (it would simply be a limitation of the pathwalk API, rather than of the Path object).
The question is what code that uses (duck-typed) Path objects expects. I'm pretty sure there was extensive discussion of why Paths should never cache during the PEP 428 discussions, and I vaguely remember both Antoine Pitrou and Nick Coghlan giving good summaries more recently, but I don't remember enough details to say whether a duck-typed Path-like object would be just as bad. But I'm guessing it could have the same problems--if some function takes a Path object, stores it for later, and expects to use it to get live info, handing it something that quacks like a Path but returns snapshot info instead would be pretty insidious.
Or just Path.glob with ** in the pattern.
Maybe the OP also didn't know about it?
So, did Antoine Pitrou already solve this problem 3 years ago (or Jason Orendorff many years before that), possibly barring a minor docs tweak, or is there still something to consider here?
He claimed he just wanted to use regular expressions so he could exclude .git directories. To tell the truth, I don't have much sympathy for that: regular expressions are just too full of traps to make a good API for file matching, and it wouldn't even strictly be sufficient to filter the entire directory tree under .git unless you added matching on the entire path -- but then you'd still pay for the cost of traversing the .git tree even if your regex were to exclude it entirely, because the library wouldn't be able to introspect the regex to determine that for sure.
I agree with everything here. I believe Path.glob can do everything he needs, and what he asked for instead couldn't do any more. It's dead-easy to imperatively apply a regex to decide whether to prune each dir in walk (or fts). Or to do the same to the joined path or the abspath. Or to use fnmatch instead of regex, or an arbitrary predicate function. Or to reverse the sense to mean only recurse on these instead of skip these. Imagine what a declarative API that allowed all that would look like. Even find doesn't have any of those options (at least not portably), and most people have to read guides to the manpage before they can read the manpage. At any rate, there's no reason you couldn't add some regex methods to Path and/or special Path handling code to regex to make that imperative code slightly easier, but I don't see how "pattern.match(str(path))" is any worse than "os.scandir(str(path))" or "json.load(str(path))" or any of the zillion other places where you have to convert paths to strings explicitly, or what makes regex more inherently path-related than those things.
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
On Tue, Dec 22, 2015 at 4:23 PM, Guido van Rossum <guido@python.org> wrote:
The two-level iteration forced upon you by os.walk() is indeed often unnecessary -- but handling dirs and files separately usually makes sense,
indeed, but not always, so a simple API that allows you to get a flat walk would be nice.... Of course for that basic use case, you could just write your own wrapper
around os.walk:
sure, but having to write "little" wrappers for common needs is unfortunate... The problem isn't designing a nice walk API; it's integrating it with
pathlib.*
indeed -- I'd really like to see a *walk in pathlib itself. I've been trying to use pathlib whenever I need, well, a path, but then I find I almost immediately need to step out and use an os.path function, and have to string-fy it anyway -- makes me wonder what the point is.. And honestly, if open, os.walk, etc. aren't going to work with Path
objects,
but they should -- of course they should..... Truly pushing for adoption of a new abstraction like this takes many years
-- pathlib was new (and provisional) in 3.4 so it really hasn't been long enough to give up on it. The OP hasn't!
it will take many years for sure -- but the standard library cold at least adopt it as much as possible. Path.walk would be a nice start :-) My example: one of our sysadmins wanted a little script to go thorugh an entire drive (Windows), and check if any paths were longer than 256 characters (Windows, remember..) I came up with this: def get_all_paths(start_dir='/'): for dirpath, dirnames, filenames in os.walk(start_dir): for filename in filenames: yield os.path.join(dirpath, filename) too_long = [] for p in get_all_paths('/'): print("checking:", p) if len(p) > 255: too_long.append(p) print("Path too long!") way too wordy! I started with pathlib, but that just made it worse. now that I think about it, maybe I could have simpily used pathlib.Path.rglob.... However, when I try that, I get a permission error: /Users/chris.barker/miniconda2/envs/py3/lib/python3.5/pathlib.py in wrapped(pathobj, *args) 369 @functools.wraps(strfunc) 370 def wrapped(pathobj, *args): --> 371 return strfunc(str(pathobj), *args) 372 return staticmethod(wrapped) 373 PermissionError: [Errno 13] Permission denied: '/Users/.chris.barker.xahome/caches/opendirectory' as the error comes insider the rglob() generator, I'm not sure how to tell it to ignore and move on.... os.walk is somehow able to deal with this. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Dec 28, 2015, at 11:25, Chris Barker <chris.barker@noaa.gov> wrote:
You're replying to me, not Guido, here... Anyway, if the only thing anyone will ever need is a handful of simple one-liners that even a novice could write, maybe it's reasonable to just add one to the docs to show how to do it, instead of adding them to the stdlib.
The problem isn't designing a nice walk API; it's integrating it with pathlib.*
indeed -- I'd really like to see a *walk in pathlib itself.
But first you have to solve the problem that paragraph was all about: a general-purpose walk API shouldn't be throwing away all that stat information it wasted time fetching, but the pathlib module is designed around Path objects that are always live, not snapshots. If Path.walk yields something that isn't a Path, what's the point?
I've been trying to use pathlib whenever I need, well, a path, but then I find I almost immediately need to step out and use an os.path function, and have to string-fy it anyway -- makes me wonder what the point is..
I have the same impression as you, but, as Guido says, let's give it time before judging...
And honestly, if open, os.walk, etc. aren't going to work with Path objects,
but they should -- of course they should.....
So far things have gone the opposite direction: open requires strings, but there's a Path.open method; walk requires strings, but people are proposing a Path.walk method; etc. I'm not sure how that's supposed to extend to things like json.load or NamedTemporaryFile.name.
Do you really want it to print out "Path too long!" hundreds of times? If not, this is a lot more concise, and I think readable, with comprehensions: walk = os.walk(start_dir) files = (os.path.join(root, file) for root, dirs, files in walk for file in files) too_long = (file for file in files if len(file) > 255) And now you've got a lazy Iterator over you too-long files. (If you need a list, just use a listcomp instead of a genexpr in the last step.)
way too wordy!
I started with pathlib, but that just made it worse.
If we had a Path.walk, I don't think it could be that much better than the original version, since the only thing Path can help with is making that join a bit shorter--and at the cost of having to convert to str to check len(): walk = start_path.Walk() files = (root / file for root, dirs, files in walk for file in files) too_long = (file for file in files if len(str(file)) > 255) As a side note, there's no Windows restriction to 255 _characters_, it's to 255 UTF-16 code points, just under 64K UTF-16 code points, or 255 codepage bytes, depending on which API you use. So you really want something like len(file.encode('utf-16') / 2) > 255. Also, I suspect you want either the bare filename or the abspath, not the path from the start dir (especially since a path rooted at the default '/' is two characters shorter than one rooted at 'C:\', so you're probably going to pass a bunch of files that then cause problems in your scripts).
data:image/s3,"s3://crabby-images/d321f/d321fa7003d562bee34e7f927e1ab5de19f84557" alt=""
Not sure how useful this is, but I ended up writing my own "pythonic find" module: https://github.com/moloney/pathmap/blob/master/pathmap.py I was mostly worried about minimizing stat calls, so I used scandir rather than Pathlib. The only documentation is the doc strings, but the basic idea is you can have one "matching" rule and any number of ignore/prune rules. The rules can be callables or strings that are treated as regular expressions (I suppose it might be better if the default was to treat strings as glob expressions instead...). So for the original use case that spawned this thread, you would do something like: pm = PathMap(prune_rules=['/\.git$']) for match in pm.matches(['path/to/some/dir']): if not match.dir_entry.is_dir(): print(match.path) Or if you wanted to do something similar but only print names of python modules it would be something like: pm = PathMap('.+/(.+)\.py$', prune_rules=['/\.git$']) for match in pm.matches(['path/to/some/dir']): if not match.dir_entry.is_dir(): print(match.match_info[1])
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
On Mon, Dec 28, 2015 at 2:43 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
I was intending to reply to the list :-)
well, it's a four liner, yes? but I'm not sure i agree -- the simple things should be simple. even if you can find the couple-liner in the docs, you've still got a lot more overhead than calling a ready-to-go function. and it's not like it'd be a heavy maintenance burden.... The problem isn't designing a nice walk API; it's integrating it with
pathlib.*
indeed -- I'd really like to see a *walk in pathlib itself. But first you have to solve the problem that paragraph was all about: a general-purpose walk API shouldn't be throwing away all that stat information it wasted time fetching, but the pathlib module is designed around Path objects that are always live, not snapshots. If Path.walk yields something that isn't a Path, what's the point? OK -- you've gotten out of my technical depth now.....so I'll just shut up. But at the end of the day, if you've got the few-liner in the docs that works, maybe it's OK that it's not optimized..... I've been trying to use pathlib whenever I need, well, a path, but then I find I almost immediately need to step out and use an os.path function, and have to string-fy it anyway -- makes me wonder what the point is..
I have the same impression as you, but, as Guido says, let's give it time before judging...
time good -- but also maybe some more work to make it easy to use with rest of the stdlib. I will say that one thing that bugs me about the "old style" os.path functions is that I find myself stringing tehm together, and that gets really ugly fast: my_path - os.path.join(os.path.split(something)[0], something_else) here's where an OO interface is much nicer. And honestly, if open, os.walk, etc. aren't going to work with Path
objects,
but they should -- of course they should..... So far things have gone the opposite direction: open requires strings, but there's a Path.open method; This sure feels to me like the wrong way to go -- too OO -heavy: create a Path object, then use it to open a file. which is why we still have the regular old open() that takes strings. I just finished teaching an intro to Python class, using py3 for the first time -- I found myself pointing students to pathlib, but then never using it in any examples, etc. That may be my old habits, but I really think we do have an ugly mix of APIs here.
walk requires strings, but people are proposing a Path.walk method; etc.
well, walk "feels" to me like a path-y operation. whereas open() does not. I'm not sure how that's supposed to extend to things like json.load or NamedTemporaryFile.name. exactly -- that's why open() doesn't feel path-y to me. you have all sorts of places where you might want to open a file, and you want to open other things as well. And I like APIs that let you pass in either an open file-like object, OR a path -- so it seems allowing either a Path object or a path-in-a-string would be good. so my "proposal" is to go through the stdlib and add the ability to accept a Path object everywhere a string path is accepted. (hmm -- could you simply wrap str() around the input?) My example: one of our sysadmins wanted a little script to go thorugh an entire drive (Windows), and check if any paths were longer than 256 characters (Windows, remember..) I came up with this: def get_all_paths(start_dir='/'): for dirpath, dirnames, filenames in os.walk(start_dir): for filename in filenames: yield os.path.join(dirpath, filename) too_long = [] for p in get_all_paths('/'): print("checking:", p) if len(p) > 255: too_long.append(p) print("Path too long!")
Do you really want it to print out "Path too long!" hundreds of times?
well, not in production, no, but was nice to test -- also, in theory, there shouldn't be many!
If not, this is a lot more concise, and I think readable, with comprehensions:
walk = os.walk(start_dir) files = (os.path.join(root, file) for root, dirs, files in walk for file in files) too_long = (file for file in files if len(file) > 255) thanks -- should have thought of that -- though that was to pass off to a sysadmin that doesn't know much python -- harder for him to read??
yup -- probably I'd write it out to a file in the real use case. or stdout. way too wordy! I started with pathlib, but that just made it worse.
If we had a Path.walk, I don't think it could be that much better than the original version,
sure -- the wordyness comes from the fact that you have to deal with dirs and files separately.
since the only thing Path can help with is making that join a bit shorter--and at the cost of having to convert to str to check len():
maybe another argument for why Path doesn't buy much over string paths...
what I really want here is: too_long = (filepath for filepath in Path(root) if len(filepath) > 255 ) I know python isn't a shell scripting language but it is a one liner in powershell or bash, or.... As a side note, there's no Windows restriction to 255 _characters_, it's to
255 UTF-16 code points,
IIUC, Windows itself, nor ntfs has this restriction, but some older utilities do -- really pathetic. And I asked our sysadmin about the unicode issue, and he hasd no idea.
just under 64K UTF-16 code points,
how is a codepoint different than a character???? I was wondering if it was a bytes restriction or codepoint restriction?
or 255 codepage bytes, depending on which API you use.
this is where it gets ugly -- who knows what API some utility is using??? So you really want something like len(file.encode('utf-16') / 2) > 255. but can't some characters use more than 2 bytes in utf-16? or is that what you're trying to catch here? Also, I suspect you want either the bare filename or the abspath, not the
path from the start dir (especially since a path rooted at the default '/' is two characters shorter than one rooted at 'C:\',
well, the startdir would be C:\ and now I'm confused about whether the "C:\" is parto f the 255-something restriction! anyway, WAY OT -- and if this is used it will be mainly to flag potential problems, not really a robust test. Thanks, -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
data:image/s3,"s3://crabby-images/2dd36/2dd36bc2d30d53161737124e2d8ace2b4b4ce052" alt=""
On Dec 28, 2015 2:33 PM, "Chris Barker" <chris.barker@noaa.gov> wrote:
On Tue, Dec 22, 2015 at 4:23 PM, Guido van Rossum <guido@python.org>
wrote:
The path.py .walk* APIs work great w/ fnmatch: https://pythonhosted.org/path.py/api.html#path.Path.walk https://pythonhosted.org/path.py/api.html#path.Path.walkdirs https://pythonhosted.org/path.py/api.html#path.Path.walkfiles
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Dec 28, 2015, at 16:50, Wes Turner <wes.turner@gmail.com> wrote:
The path module has some major differences. First, because it doesn't use scandir or anything else to avoid multiple stat calls, the caching issue doesn't come up. Also, because its Path subclasses str, it doesn't have the same usability issues (you can pass a Path straight to json.loads, for example), although of course that gives it different usability issues (e.g., inherited methods like Path.count are an obvious attractive nuisance). Also, it doesn't handle case sensitivity as automagically. Also, it's definitely the kind of "kitchen sink" design that got PEP 355 rejected (which often makes sense for a third-party lib even when it doesn't for a stdlib module). So, not everything that makes sense for path will also make sense for pathlib. But it's still worth looking at.
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
Following up on this, in theory the right way to walk a tree using pathlib already exists, it's the rglob() method. E.g. all paths under /foo/bar should be found as follows: for path in pathlib.Path('/foo/bar').rglob('**/*'): print(path) The PermissionError bug you found is already reported: http://bugs.python.org/issue24120 -- it even has a patch but it's stuck in review. Sadly there's another error: loops introduced by symlinks cause infinite recursion. I filed that here: http://bugs.python.org/issue26012. (The fix should be judicious use of is_symlink(), but the code is a little convoluted.) On Mon, Dec 28, 2015 at 11:25 AM, Chris Barker <chris.barker@noaa.gov> wrote:
-- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Mon, Jan 4, 2016 at 9:25 PM, Guido van Rossum <guido@python.org> wrote:
Whoops, I just realized that I combined two ways of doing a recursive glob here. It should be either rglob('*') or plain glob('**/*'). What I wrote produces identical results, but at the cost of a lot of caching. :-) Note that the PEP doesn't mention rglob() -- why do we even have it? It seems rglob(pat) is exactly the same as glob('**/' + path) (assuming os.sep is '/'). No TOOWTDI here? -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
Note that the PEP doesn't mention rglob() -- why do we even have it? It seems rglob(pat) is exactly the same as glob('**/' + path) (assuming os.sep is '/'). No TOOWTDI here?
Much as I believe in TOOWTDI, I like having rglob(). "**/" is the kind of magic a newbie ( like me :-) ) would have research and understand. -CHB
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Tue, Jan 5, 2016 at 8:37 AM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
Sure. It's too late to remove it anyway. Is there anything actionable here besides fixing the PermissionError and the behavior under symlink loops? IMO if you want files only or directories only you can just add a filter using e.g. is_dir(): p = pathlib.Path.cwd() real_dirs = [p for p in p.rglob('*') if p.is_dir() and not p.is_symlink()] -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/d321f/d321fa7003d562bee34e7f927e1ab5de19f84557" alt=""
The main issue is the lack of stat caching. That is why I wrote my own module around scandir which includes the DirEntry objects for each path so that the consumer can also do stuff with the cached stat info (like check if it is a file or directory). Often we won't need to call stat on the path at all, and if we do it will only be once. Brendan Moloney Research Associate Advanced Imaging Research Center Oregon Health Science University ________________________________ From: Python-ideas [python-ideas-bounces+moloney=ohsu.edu@python.org] on behalf of Guido van Rossum [guido@python.org] Sent: Tuesday, January 05, 2016 12:21 PM To: Chris Barker - NOAA Federal Cc: Python-Ideas Subject: Re: [Python-ideas] find-like functionality in pathlib On Tue, Jan 5, 2016 at 8:37 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov<mailto:chris.barker@noaa.gov>> wrote:
Note that the PEP doesn't mention rglob() -- why do we even have it? It seems rglob(pat) is exactly the same as glob('**/' + path) (assuming os.sep is '/'). No TOOWTDI here?
Much as I believe in TOOWTDI, I like having rglob(). "**/" is the kind of magic a newbie ( like me :-) ) would have research and understand. Sure. It's too late to remove it anyway. Is there anything actionable here besides fixing the PermissionError and the behavior under symlink loops? IMO if you want files only or directories only you can just add a filter using e.g. is_dir(): p = pathlib.Path.cwd() real_dirs = [p for p in p.rglob('*') if p.is_dir() and not p.is_symlink()] -- --Guido van Rossum (python.org/~guido<http://python.org/~guido>)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Tue, Jan 5, 2016 at 12:27 PM, Brendan Moloney <moloney@ohsu.edu> wrote:
I wonder if stat() caching shouldn't be made an orthogonal optional feature of Path objects somehow; it keeps coming back as useful in various cases even though we don't want to enable it by default. One problem with stat() caching is that Path objects are considered immutable, and two Path objects referring to the same path are completely interchangeable. For example, {pathlib.Path('/a'), pathlib.Path('/a')} is a set of length 1: {PosixPath('/a')}. But if we had e.g. Path('/a', cache_stat=True), the behavior of two instances of that object might be observably different (if they were instantiated at times when the contents of the filesystem was different). So maybe stat-caching Path instances should be considered unequal, or perhaps unhashable. Or perhaps they should only be considered equal if their stat() values are actually equal (i.e. if the file's stat() info didn't change). . So this is a thorny issue that requires some real thought before we commit to an API. We might also want to create Path instances directly from DirEntry objects. (Interesting, the DirEntry API seems to be a subset of the Path API, except for the .path attribute which is equivalent to the str() of a Path object.) Maybe some of this can be done first as a 3rd party module forked from the original 3rd party pathlib? https://bitbucket.org/pitrou/pathlib/ seems reasonably up to date. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/2658f/2658f17e607cac9bc627d74487bef4b14b9bfee8" alt=""
Guido van Rossum wrote:
Maybe path.stat() could return a PathWithStat object that inherits from Path and can do everything that a Path can do, but also contains cached stat info and has a suitable set of attributes for accessing it. This would make it clear at what point in time the info is valid for, i.e. the moment you called stat(). It would also provide an obvious way to refresh the info: calling path_with_stat.stat() would give you a new PathWithStat containing updated info. Things like scandir could then return pre-populated PathWithStat objects. -- Greg
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Tue, Jan 5, 2016 at 2:59 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Well, Path.stat() is already defined and returns the same type of object that os.stat() returns, and I don't think we should change that. We could add a new method that does this, but as long as it inherits from Path it wouldn't really address the issue with objects being == to each other but holding different stat info.
I presume you are proposing a new Path.scandir() method -- the existing os.scandir() method already returns DirEntry objects which we really don't want to change at this point. -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Wed, Jan 6, 2016 at 8:11 AM, Random832 <random832@fastmail.com> wrote:
It would have to use a weak dict so if the last reference goes away it discards the cached stats for a given path, otherwise you'd have trouble containing the cache size. And caching Path objects should still not be comparable to non-caching Path objects (which we will need to preserve the semantics that repeatedly calling stat() on a Path object created the default way will always redo the syscall). The main advantage would be that caching Path objects could be compared safely. It could still cause unexpected results. E.g. if you have just traversed some big tree using caching, and saved some results (so hanging on to some paths and hence their stat() results), and then you make some changes and traverse it again to look for something else, you might accidentally be seeing stale (i.e. cached) stat() results. Maybe there's a middle ground, where the user can create a StatCache object and pass it into Path creation and traversal operations. Paths with the same StatCache object (or both None) compare equal if their path components are equal. Paths with different StatCache objects never compare equal (but otherwise are ordered by path as usual -- the StatCache object's identity is only used when the paths are equal. Are you (or anyone still reading this) interested in implementing this idea? -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
I couldn't help myself and coded up a prototype for the StatCache design I sketched. See http://bugs.python.org/issue26031. Feedback welcome! On my Mac it only seems to offer limited benefits though... On Wed, Jan 6, 2016 at 8:48 AM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/d321f/d321fa7003d562bee34e7f927e1ab5de19f84557" alt=""
Its important to keep in mind the main benefit of scandir is you don't have to do ANY stat call in many cases, because the directory listing provides some subset of this info. On Linux you can at least tell if a path is a file or directory. On windows there is much more info provided by the directory listing. Avoiding subsequent stat calls is also nice, but not nearly as important due to OS level caching. Brendan Moloney Research Associate Advanced Imaging Research Center Oregon Health Science University ________________________________ From: Python-ideas [python-ideas-bounces+moloney=ohsu.edu@python.org] on behalf of Guido van Rossum [guido@python.org] Sent: Wednesday, January 06, 2016 2:42 PM To: Random832 Cc: Python-Ideas Subject: Re: [Python-ideas] find-like functionality in pathlib I couldn't help myself and coded up a prototype for the StatCache design I sketched. See http://bugs.python.org/issue26031. Feedback welcome! On my Mac it only seems to offer limited benefits though...
data:image/s3,"s3://crabby-images/f81c3/f81c349b494ddf4b2afda851969a1bfe75852ddf" alt=""
On Wed, Jan 6, 2016 at 3:05 PM Brendan Moloney <moloney@ohsu.edu> wrote:
+1 - this was one of the two primary motivations behind scandir. Anything trying to reimplement a filesystem tree walker without using scandir is going to have sub-standard performance. If we ever offer anything with "find like functionality" related to pathlib, it *needs* to be based on scandir. Anything else would just be repeating the convenient but untrue limiting assumptions of os.listdir: That the contents of a directory can be loaded into memory and that we don't mind re-querying the OS for stat information that it already gave us but we threw away as part of reading the directory. -gps
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 11 January 2016 at 18:57, Gregory P. Smith <greg@krypto.org> wrote:
This is very much why I feel that we need something in pathlib. I understand the motivation for not caching stat information in path objects. And I don't have a viable design for how a "find-like functionality" API should be implemented in pathlib. But as it stands, I feel as though using pathlib for anything that does bulk filesystem scans is deliberately choosing something that I know won't scale well. So (in my mind) pathlib doesn't fulfil the role of "one obvious way to do things". Which is a shame, because Path.rglob is very often far closer to what I need in my programs than os.walk (even when it's just rootpath.rglob('*')). In practice, by far the most common need I have[1] for filetree walking is to want to get back a list of all the names of files starting at a particular directory with the returned filenames *relative to the given root*. Pathlib.rglob gives absolute pathnames. os.walk gives the absolute directory name and the base filename. Neither is what I want, although obviously in both cases it's pretty trivial to extract the "relative to the root" part from the returned data. But an API that gave that information directly, with scandir-level speed and scalability, in the form of pathlib.Path relative path objects, would be ideal for me[1]. Paul [1] And yes, I know this means I should just write a utility function for it :-) [2] The feature creep starts when people want to control things like pruning particular directories such as '.git', or only matching particular glob patterns, or choosing whether or not to include directories in the output, or... Adding *those* features without ending up with a Frankenstein's monster of an API is the challenge :-)
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
On Mon, Jan 11, 2016 at 10:57 AM, Gregory P. Smith <greg@krypto.org> wrote:
And we already have this in the form of pathlib's [r]glob() methods. There's a patch to the glob module in http://bugs.python.org/issue25596 and as soon as that's committed I hope that its author(s) will work on doing a similar patch for pathlib's [r]glob (tracking this in http://bugs.python.org/issue26032). -- --Guido van Rossum (python.org/~guido)
participants (15)
-
Andrew Barnert
-
Brendan Moloney
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Erik
-
Greg Ewing
-
Gregory P. Smith
-
Guido van Rossum
-
Laura Creighton
-
Paul Moore
-
Ram Rachum
-
Random832
-
Sven R. Kunze
-
Thomas Güttler
-
Wes Turner