[Python-Dev] PEP 428: stat caching undesirable?

Nick Coghlan ncoghlan at gmail.com
Thu May 2 00:02:53 CEST 2013

On 2 May 2013 02:22, "Christian Heimes" <christian at python.org> wrote:
> Am 01.05.2013 16:39, schrieb Guido van Rossum:
> > I've not got the full context, but I would like to make it *very*
> > clear in the API (e.g. through naming of the methods) when you are
> > getting a possibly cached result from stat(), and I would be very
> > concerned if existing APIs were going to get caching behavior. For
> > every use cases that benefits from caching there's a complementary use
> > case that caching breaks. Since both use cases are important we must
> > offer both APIs, in a way that makes it clear to even the casual
> > reader of the code what's going on.
> I deem caching of stat calls as problematic. The correct and
> contemporary result of a stat() call has security implications, too. For
> example stat() is used to prevent TOCTOU race conditions such as [1].
> Caching is useful but I would prefer explicit caching rather than
> implicit and automatic caching of stat() results.
> We can get a greater speed up for walkdir() without resorting to
> caching, too. Some operating systems and file system report the file
> type in the dirent struct that is returned by readdir(). This reduces
> the number of stat calls to zero.

While I agree exposing dirent in some manner is desirable, note that I'm
not talking about os.walk itself, but the generator pipeline library I
built around it in an attempt to break up monolithic directory walking
loops into reusable components. Once you get out of the innermost
generator, the only state passed through each stage is the path information
(and the directory descriptor if using os.fwalk).

Upgrading walkdir from simple strings to path objects would be relatively
straightforward, but you can't change the API too much before it isn't
similar to os.walk any more.

The security issues only come into play in the outer loop which actually
tries to *do* something with the pipeline output. However, even that case
should involve at most two stat calls: one inside the pipeline (cached per
iteration) and then a more timely one in the outer loop (assuming using
os.fwalk as the base loop instead of os.walk doesn't already cover it).


> Christian
> [1]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20130502/8b8563a3/attachment.html>

More information about the Python-Dev mailing list