[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)

Nick Coghlan ncoghlan at gmail.com
Mon Nov 25 00:18:56 CET 2013


On 25 Nov 2013 09:07, "Ben Hoyt" <benhoyt at gmail.com> wrote:
>
> > Right now, pathlib doesn't cache. Guido decided it was safer to start
> > off like that, and perhaps later we can add some optional caching.
> >
> > One reason caching didn't go in is that it's not clear which API is
> > best. Working on pluggin scandir() into pathlib would actually help
> > choosing a stat-caching API.
> >
> > (or, rather, lstat-caching...)
> >
> >> The other related thing is that DirEntry only provides .lstat(),
> >> because it's providing stat-like info without following links.
> >
> > Path.is_dir() and friends use stat(), i.e. they inform you about
> > whether a symlink's target is a directory (not the symlink itself).  Of
> > course, if the DirEntry says the path is a symlink, Path.is_dir() could
> > then run stat() to find out about the target.
> >
> > Do you plan to propose scandir() for inclusion in the stdlib?
>
> Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry
> objects" for inclusion into the stdlib, and also speed up os.walk() as
> a result.
>
> However, pathlib's API with .is_dir() and .lstat() etc are so close to
> DirEntry, I'd be much keener to roll up the scandir functionality into
> pathlib's iterdir(), as that's already going in the standard library,
> and iterdir() already returns Path objects.
>
> I'm just not sure it's possible or useful without stat caching.
>
> We could do Path.lstat(cached=True), but we'd also really want
> is_dir(cached=True), so that API kinda sucks. Alternatively you could
> have iterdir(cached=True) return PathWithCachedStat style objects --
> probably better, but kinda messy.
>
> For these reasons, I would much prefer stat caching on by default in
> Path -- in my experience, the cached behaviour is desired much much
> more often than the non-cached. I've written directory walkers more
> often than I can count, whereas I've maybe only once written a
> long-running process that needs to re-stat, and if it's clearly
> documented as cached, then it's super easy to call restat(), or create
> a new Path instance to get new stat info.
>
> This would allow iterdir() to take advantage of the huge performance
> improvements you can get when walking directories.
>
> Guido, are you at all open to reconsidering the uncached-by-default in
> light of this?

No, caching on the object is dangerously unintuitive - it means two Path
objects can compare equal, but give different answers for stat-dependent
queries.

A global string (or Path) keyed cache (rather than a per-object cache)
would actually be a safer option, since it would ensure distinct path
objects always gave the same answer. That's the approach I will likely
pursue at some point in walkdir.

It's also quite likely the "rich stat object" API will be pursued for 3.5,
which is a much safer approach to stat result caching than trying to embed
it directly in pathlib.Path objects.

That's why we decided to punt on the caching question until 3.5 - it's
better to provide a predictable building block that doesn't provide
caching, and then work out how to provide a sensible caching layer on top
of that, rather than trying to rush a potentially flawed caching design
that leads to inconsistent behaviour.

Cheers,
Nick.

>
> -Ben
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20131125/a19b1693/attachment-0001.html>


More information about the Python-Dev mailing list