A sets algorithm

Tim Chase python.list at tim.thechases.com
Sun Feb 7 19:20:50 EST 2016


On 2016-02-08 00:05, Paulo da Silva wrote:
> Às 22:17 de 07-02-2016, Tim Chase escreveu:
>>   all_files = list(generate_MyFile_objects())
>>   interesting = [
>>     (my_file1, my_file2)
>>     for i, my_file1
>>     in enumerate(all_files, 1)
>>     for my_file2
>>     in all_files[i:]
>>     if my_file1 == my_file2
>>     ]
> 
> "my_file1 == my_file2" can be implemented into MyFile class taking
> advantage of caching sizes (if different files are different),
> hashes or even content (for small files) or file headers (first n
> bytes). However this seems to have a problem:
> all_files: a b c d e ...
> If a==b then comparing b with c,d,e is useless.

Depends on what the OP wants to have happen if more than one input
file is equal. I.e., a == b == c.  Does one just want "a has
duplicates" (and optionally "and here's one of them"), or does one
want "a == b", "a == c" and "b == c" in the output?

> Another solution I thought of, could be defining some methods (I
> still don't know which ones) in MyFile so that I could use sets
> intersection. Would this one be a faster solution?

Adding __hash__ would allow for the set operations, but would
require (as ChrisA points out) knowing how to create a hash function
that encompasses the information you want to compare.

-tkc




More information about the Python-list mailing list