program to generate data helpful in finding duplicate large files

Chris Angelico rosuav at
Fri Sep 19 15:59:34 CEST 2014

On Fri, Sep 19, 2014 at 11:32 PM, David Alban <extasia at> wrote:
> thanks for the responses.   i'm having quite a good time learning python.

Awesome! But while you're at it, you may want to consider learning
English on the side; capitalization does make your prose more
readable. Also, it makes you look careless - you appear not to care
about your English, so it's logical to expect that you may not care
about your Python either. That may be completely false, but it's still
the impression you're creating.

> On Thu, Sep 18, 2014 at 11:45 AM, Chris Kaynor <ckaynor at>
> wrote:
>> Additionally, you may want to specify binary mode by using open(file_path,
>> 'rb') to ensure platform-independence ('r' uses Universal newlines, which
>> means on Windows, Python will convert "\r\n" to "\n" while reading the
>> file). Additionally, some platforms will treat binary files differently.
> would it be good to use 'rb' all the time?

Only if you're reading binary files. In the program you're doing here,
yes; you want binary mode.

> if you omit the exit statement it in this example, and
> $report_mode is not set, your shell program will give a non-zero return code
> and appear to have terminated with an error.  in shell the last expression
> evaluated determines the return code to the os.

IMO that's a terrible misfeature. If you actually want the return
value to be propagated, you should have to say so - something like:

exit $?

Fortunately, Python isn't like that.

> style question:  if there is only one, possibly short statement in a block,
> do folks usually move it up to the line starting the block?
>   if not S_ISREG( mode ) or S_ISLNK( mode ):
>     return
> vs.
>   if not S_ISREG( mode ) or S_ISLNK( mode ): return
> or even:
>   with open( file_path, 'rb' ) as f: md5sum = md5_for_file( file_path )

Only if it's really short AND it makes very good sense that way. Some
people would say "never". In the first case, I might do it, but not
the second. (Though that's not necessary at all, there; md5_for_file
opens and closes the file, so you don't need to open it redundantly
before calling.)


More information about the Python-list mailing list