[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)

Guido van Rossum guido at python.org
Mon Nov 25 00:22:19 CET 2013


On Sun, Nov 24, 2013 at 3:04 PM, Ben Hoyt <benhoyt at gmail.com> wrote:

> > Right now, pathlib doesn't cache. Guido decided it was safer to start
> > off like that, and perhaps later we can add some optional caching.
> >
> > One reason caching didn't go in is that it's not clear which API is
> > best. Working on pluggin scandir() into pathlib would actually help
> > choosing a stat-caching API.
> >
> > (or, rather, lstat-caching...)
> >
> >> The other related thing is that DirEntry only provides .lstat(),
> >> because it's providing stat-like info without following links.
> >
> > Path.is_dir() and friends use stat(), i.e. they inform you about
> > whether a symlink's target is a directory (not the symlink itself).  Of
> > course, if the DirEntry says the path is a symlink, Path.is_dir() could
> > then run stat() to find out about the target.
> >
> > Do you plan to propose scandir() for inclusion in the stdlib?
>
> Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry
> objects" for inclusion into the stdlib, and also speed up os.walk() as
> a result.
>
> However, pathlib's API with .is_dir() and .lstat() etc are so close to
> DirEntry, I'd be much keener to roll up the scandir functionality into
> pathlib's iterdir(), as that's already going in the standard library,
> and iterdir() already returns Path objects.
>
> I'm just not sure it's possible or useful without stat caching.
>
> We could do Path.lstat(cached=True), but we'd also really want
> is_dir(cached=True), so that API kinda sucks. Alternatively you could
> have iterdir(cached=True) return PathWithCachedStat style objects --
> probably better, but kinda messy.
>
> For these reasons, I would much prefer stat caching on by default in
> Path -- in my experience, the cached behaviour is desired much much
> more often than the non-cached. I've written directory walkers more
> often than I can count, whereas I've maybe only once written a
> long-running process that needs to re-stat, and if it's clearly
> documented as cached, then it's super easy to call restat(), or create
> a new Path instance to get new stat info.
>
> This would allow iterdir() to take advantage of the huge performance
> improvements you can get when walking directories.
>
> Guido, are you at all open to reconsidering the uncached-by-default in
> light of this?


I think we should think hard and deep about all the consequences. I was
initially in favor of stat caching, but during offline review of PEP 428
Nick pointed out that there are too many different ways to do stat caching,
and convinced me that it would be wrong to rush it. Now that beta 1 is out
I really don't want to reconsider this -- we really need to stick to the
plan.

The ship has likewise sailed for adding scandir() (whether to os or
pathlib). By all means experiment and get it ready for consideration for
3.5, but I don't want to add it to 3.4.

In general I think there are some tough choices regarding stat caching. You
already brought up stat vs. lstat -- there's also the issue of what to do
if [l]stat fails -- do we cache the exception?

IMO, the current incarnation is for convenience, correctness and
cross-platform semantics -- three C's. The next incarnation can add a
fourth C, caching.

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20131124/eaaa2b0e/attachment.html>


More information about the Python-Dev mailing list