New file utility for shutil - linktree.py
macquigg at cadence.com
Thu Jan 2 21:13:18 CET 2003
Thanks for your feedback on this proposal. We could modify copy_tree to add a 'link' argument, but that still is not what we need. The resulting 'newtree' would contain a link for every file in the oldtree. The 'linktree' function produces a 'newtree' with only the minimum links necessary to access all the old files. i.e. it doesn't descend into directories where there are no new files, but simply relies on the link to the old parent directory to access all the old files on that branch.
This doesn't make much difference for small trees, where you could even use shutil.copytree(), but it makes a big difference on a typical commercial distribution tree. For example, our latest 'IC50_sun4v' distribution tree takes 2741MB on disk, and has thousands of files, directories and links. I just received a patch tree 'IC50_sun4v_patch014' with nothing but a few patch files and the directories leading down to where those files are located. The patch tree (14MB) took only 2 minutes to download over my DSL line.
I then ran my Python command:
unix> linktree IC50_sun4v IC50_sun4v_patch014 > logfile
and in three seconds I had a new working tree with the new patches in place, and links to the oldtree for everything else. The new tree is not much larger than the 14MB patch. I also have a 15KB logfile with a line for every link added. A quick review of the logfile shows about 200 new links, and one file left as is. This is more links than normal, because I deliberately chose a patch file deeply buried in a directory with hundreds of other files.
I tried running distutils.dir_util.copy_tree() for a comparison, but it chokes on open links in the oldtree (and we can expect a few of those in any distribution this size). I'm sure that could be fixed, but the fundamental question is, do we want to enhance copy_tree() or provide a new linktree() function. I really dislike the fact that we have so much confusion in these file functions, and they are scattered in so many places (os, os.path, shutil, distutils, ???). I hesitate to add yet another function, but I think in this case 'linktree' is conceptually different than 'copytree', and it will be less confusing for users to have a distinct function.
It will also be less confusing for anyone who has to work with the sourcecode of these functions. Right now, copy_tree() is big, but still easy to follow. Mucking it up with logic to scan the dst directory instead of the src, block options which don't apply, etc., will put it over the top. (I could be missing something here, so anyone with ideas on how to merge these two functions, please speak up.)
I think one reason we don't have a simple set of universal file functions is that there are so many variations that no simple set will satisfy everyone. Someone writes a copytree function using a horrible combination of 'tar' commands, and it doesn't work on other platforms, or even on the same platform if the path is set for a different version of tar. Even with everything set up just right, the tar command doesn't copy the timestamps. So we have endless variations on these commands, some built on 'cpio', some on 'pax', some others on 'gnutar'.
Python can change all this. In fact we are 90% there with what is in the various modules I've looked at. We just need to put the most usefule stuff in one place, add a few minor enhancements, and resist the temptation to do everything. Python has the unique advantage that well-written scripts are easy to modify. So if we leave out some option that a user needs, he can easily add it.
On 'linktree' for example, I have chosen to make links to links, rather than following those links and linking to their destinations. Other legitimate choices include making the new link point to an equivalent place in the new tree. Rather than provide options for all those possibilities, I think it is better to chose the one which I think will be most useful, and keep the script so simple that anyone can modify it to their liking.
I'm putting together a 'my_file_utils' module, which will import all the good stuff from the other modules, add 'linktree', and add some options to os.path.exists(). Eventually, I'll submit this as a patch, but I want to play with it a while, and make sure I'm happy with everything. Is there a place to share this code before it is ready for submission as a patch?
Wish I had more time to work on this. Now, I've got to get back to my "day job".
> -----Original Message-----
> From: Greg Ward [mailto:gward at python.net]
> Sent: Monday, December 30, 2002 7:43 PM
> To: David MacQuigg
> Cc: python-list at python.org; Michael Hudson; Mark Lutz
> Subject: Re: New file utility for shutil - linktree.py
> On 30 December 2002, David MacQuigg said:
> > I need a 'linktree' utility to create a "mock hierarchy" for testing
> > patches to our mongo software distributions (~2GB, 10,000 files,
> > hundreds of dirs and symlinks, some of which are broken). I found
> > some utilities in 'shutil.py', but no good. 'copytree' copies the
> > entire 2GB hierarchy! The mock hierarchy should have just the new
> > patch files, and links for everything else.
> Look closely at distutils.file_util.copy_file() and
> distutils.dir_util.copy_tree(); I think they're pretty close
> to what you
> want. The only obvious deficiency is that copy_tree() doesn't have a
> 'link' argument like copy_file() does. That should be easy to add --
> submit a patch, do *not* assign it to me, argue for it on
> and hopefully someone will check it in in time for Python 2.3. Or you
> can just steal the code...
> Greg Ward <gward at python.net>
> Laziness, Impatience, Hubris.
More information about the Python-list