os.walk restart

Wed Mar 17 23:49:58 EDT 2010

On Mar 17, 3:04 pm, Keir Vaughan-taylor <kei... at gmail.com> wrote:
> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
>     run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to
> a file
> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?

I assume it's the operation that you are doing on each file that is
expensive, not the walk itself.

If that's the case, then you might be able to get away with just
leaving some kind of breadcrumbs whenever you've successfully
processed a directory or a file, so you can quickly short-circuit
entire directories or files on the next run, without having to
implement any kind of complicated start-where-I-left-off before
algorithm.

The breadcrumbs could be hidden files in the file system, or an easy-
indexable list of files that you persist, etc.

What are you doing that takes so long?

Also, I can understand why the operations on the files themselves
might crash, but can't you catch an exception and keep on chugging?

Another option, if you do not do some kind of pruning on the fly, is
to persist the list of files that you need to process up front to a
file, or a database, and persist the index of the last successfully
processed file, so that you can restart as needed from where you left
off.