[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

Wed Nov 14 20:00:49 CET 2012

On Thu, Nov 15, 2012 at 7:43 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 14 Nov 2012 13:20:31 -0500
> random832 at fastmail.us wrote:
>> On Wed, Nov 14, 2012, at 5:52, Antoine Pitrou wrote:
>> > This assumes directory entries are sorted by inode number (in a btree,
>> > I imagine). Is this assumption specific to some Linux / Ubuntu
>> > filesystem?
>>
>> I think he was proposing listing the whole directory in advance (which
>> os.walk already does), sorting it, and then looping over it calling
>> stat.
>
> But I don't understand why sorting (by inode? by name?) would make
> stat() calls faster. That's what I'm trying to understand.

Less head seeks.

Consider 1000 files in a directory. They will likely be grouped on the
disk for locality of reference (e.g. same inode group) though mv's and
other operations mean this is only 'likely' not 'a given'.

when you do 1000 stat calls in a row, the kernel is working with a
command depth of 1, so it can't elevator seek optimise the work load.
If it sees linear IO it will trigger readahead, but thats less likely.
What the sort by inode does is mean that the disk head never needs to
seek backwards, so you get a single outward pass over the needed
sectors.

This is a widely spread technique:
http://git.661346.n2.nabble.com/RFC-HACK-refresh-index-lstat-in-inode-order-td7347768.html
is a recent recurrence of the discussion, with sources cited.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Cloud Services