A sets algorithm

Paulo da Silva p_s_d_a_s_i_l_v_a_ns at netcabo.pt
Sun Feb 7 19:05:16 EST 2016


Às 22:17 de 07-02-2016, Tim Chase escreveu:
> On 2016-02-07 21:46, Paulo da Silva wrote:
...

> 
> If you the MyFile objects can be unique but compare for equality
> (e.g. two files on the file-system that have the same SHA1 hash, but
> you want to know the file-names), you'd have to do a paired search
> which would have worse performance and would need to iterate over the
> data multiple times:
> 
>   all_files = list(generate_MyFile_objects())
>   interesting = [
>     (my_file1, my_file2)
>     for i, my_file1
>     in enumerate(all_files, 1)
>     for my_file2
>     in all_files[i:]
>     if my_file1 == my_file2
>     ]
> 
"my_file1 == my_file2" can be implemented into MyFile class taking
advantage of caching sizes (if different files are different), hashes or
even content (for small files) or file headers (first n bytes).
However this seems to have a problem:
all_files: a b c d e ...
If a==b then comparing b with c,d,e is useless.

May be using several steps with dict - sizes, then hashes for same sizes
files, etc ...

Another solution I thought of, could be defining some methods (I still
don't know which ones) in MyFile so that I could use sets intersection.
Would this one be a faster solution?

Thanks



More information about the Python-list mailing list