program to generate data helpful in finding duplicate large files

Steven D'Aprano steve+comp.lang.python at
Fri Sep 19 07:45:51 CEST 2014

David Alban wrote:

> *#!/usr/bin/python*
> *import argparse*
> *import hashlib*
> *import os*
> *import re*
> *import socket*
> *import sys*

Um, how did you end up with leading and trailing asterisks? That's going to
stop your code from running.

> *from stat import **

"import *" is slightly discouraged. It's not that it's bad, per se, it's
mostly designed for use at the interactive interpreter, and it can lead to
a few annoyances if you don't know what you are doing. So be careful of
using it when you don't need to.

> *start_directory = re.sub( '/+$', '', args.start_directory )*

I don't think you need to do that, and you certainly don't need to pull out
the nuclear-powered bulldozer of regular expressions just to crack the
peanut of stripping trailing slashes from a string.

start_directory = args.start_directory.rstrip("/")

ought to do the job.

> *    f = open( file_path, 'r' )*
> *    md5sum = md5_for_file( f )*

You never close the file, which means Python will close it for you, when it
is good and ready. In the case of some Python implementations, that might
not be until the interpreter shuts down, which could mean that you run out
of file handles!

Better is to explicitly close the file:

    f = open(file_path, 'r')
    md5sum = md5_for_file(f)

or if you are using a recent version of Python and don't need to support
Python 2.4 or older:

    with open(file_path, 'r') as f:
        md5sum = md5_for_file(f)

(The "with" block automatically closes the file when you exit the indented

> *    sep = ascii_nul*

Seems a strange choice of a delimiter.

> *    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
> dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

Arggh, my brain! *wink*

Try this instead:

    s = '\0'.join([thishost, md5sum, dev, ino, nlink, size, file_path])
    print s

> *exit( 0 )*

No need to explicitly call sys.exit (just exit won't work) at the end of
your code. If you exit by falling off the end of your program, Python uses
a exit code of zero. Normally, you should only call sys.exit to:

- exit with a non-zero code;

- to exit early.


More information about the Python-list mailing list