deduping

Mon Jun 21 09:27:03 EDT 2010

dirknbr wrote:

> Hi
> 
> I have 2 files (done and outf), and I want to chose unique elements
> from the 2nd column in outf which are not in done. This code works but
> is not efficient, can you think of a quicker way? The a=1 is just a
> redundant task obviously, I put it this way around because I think
> 'in' is quicker than 'not in' - is that true?
> 
> done_={}
> for line in done:
>     done_[line.strip()]=0
> 
> print len(done_)
> 
> universe={}
> for line in outf:
>     if line.split(',')[1].strip() in universe.keys():
>         a=1
>     else:
>         if line.split(',')[1].strip() in done_.keys():
>             a=1
>         else:
>             universe[line.split(',')[1].strip()]=0

Instead of

if key in some_dict.keys():
    #...

which converts the keys in the dictionary to a list and then performs an 
O(N) lookup on that list you should use

if key in some_dict:
    #...

which doesn't build a list and looks up the key in constant time.

Peter