[Python-Dev] pathlib and issue 11406 (a directory iterator returning stat-like info)

Ben Hoyt benhoyt at gmail.com
Mon Nov 25 00:04:28 CET 2013


> Right now, pathlib doesn't cache. Guido decided it was safer to start
> off like that, and perhaps later we can add some optional caching.
>
> One reason caching didn't go in is that it's not clear which API is
> best. Working on pluggin scandir() into pathlib would actually help
> choosing a stat-caching API.
>
> (or, rather, lstat-caching...)
>
>> The other related thing is that DirEntry only provides .lstat(),
>> because it's providing stat-like info without following links.
>
> Path.is_dir() and friends use stat(), i.e. they inform you about
> whether a symlink's target is a directory (not the symlink itself).  Of
> course, if the DirEntry says the path is a symlink, Path.is_dir() could
> then run stat() to find out about the target.
>
> Do you plan to propose scandir() for inclusion in the stdlib?

Yes, I was hoping to propose adding "os.scandir() -> yields DirEntry
objects" for inclusion into the stdlib, and also speed up os.walk() as
a result.

However, pathlib's API with .is_dir() and .lstat() etc are so close to
DirEntry, I'd be much keener to roll up the scandir functionality into
pathlib's iterdir(), as that's already going in the standard library,
and iterdir() already returns Path objects.

I'm just not sure it's possible or useful without stat caching.

We could do Path.lstat(cached=True), but we'd also really want
is_dir(cached=True), so that API kinda sucks. Alternatively you could
have iterdir(cached=True) return PathWithCachedStat style objects --
probably better, but kinda messy.

For these reasons, I would much prefer stat caching on by default in
Path -- in my experience, the cached behaviour is desired much much
more often than the non-cached. I've written directory walkers more
often than I can count, whereas I've maybe only once written a
long-running process that needs to re-stat, and if it's clearly
documented as cached, then it's super easy to call restat(), or create
a new Path instance to get new stat info.

This would allow iterdir() to take advantage of the huge performance
improvements you can get when walking directories.

Guido, are you at all open to reconsidering the uncached-by-default in
light of this?

-Ben


More information about the Python-Dev mailing list