Slow comparison between two lists

Peter Otten __peter__ at web.de
Thu Oct 23 14:24:25 CEST 2008


Jani Tiainen wrote:

> I have rather simple 'Address' object that contains streetname,
> number, my own status and x,y coordinates for it. I have two lists
> both containing approximately 30000 addresses.
> 
> I've defined __eq__ method in my class like this:
> 
>     def __eq__(self, other):
>         return self.xcoord == other.xcoord and \
>             self.ycoord == other.ycoord and \
>             self.streetname == other.streetname and \
>             self.streetno == other.streetno
> 
> But it turns out to be very, very slow.
> 
> Then I setup two lists:
> 
> list_external = getexternal()
> list_internal = getinternal()
> 
> Now I need get all all addresses from 'list_external' that are not in
> 'list_internal', and mark them as "new".
> 
> I did it like this:
> 
> for addr in list_external:
>     if addr not in list_internal:
>         addr.status = 1 # New address
> 
> But in my case running that loop takes about 10 minutes. What I am
> doing wrong?

Even if list_external and list_internal contain the same items you need
about len(list_external)*(len(list_internal)/2), or 45 million comparisons.
You can bring that down to 30000*some_small_factor if you make your
addresses hashable and put the internal ones into a dict or set

def __hash__(self):
   return hash((self.xcoord, self.yccord, self.streetname, self.streetno))
def __eq__(self, other):
   # as above

Then

set_internal = set(list_internal)
for addr in list_external:
    if addr not in set_internal:
        addr.status = 1

Note that the attributes relevant for hash and equality must not change
during this process.

Peter



More information about the Python-list mailing list