Comparing a lot of data efficiently

Hans Nowak hnowak at cuci.nl
Mon Aug 27 13:17:44 CEST 2001


>===== Original Message From Marcus Stojek <stojek at part-gmbh.de> =====
>
>Okay, here is what I do. I have two files (ASCII), each of them
>containing a huge amount (let's say 100.000) triangles. The format is:
>
>Number of triangle,number of first node, number of second node , number
>of third node \n
>
>node are corners of the triangle here. Each node has three coordinates in
>space, but we don't need them here.
>
>The two files are describing exactly the same group of triangles, but the
>triangles are numbered differently in the two files. So the triangle
>containing the nodes (12,49583,331) has the number 5 in one file and the
>number 23094 in the other one. As two triangles can share two nodes I
>have to compare all three nodes for all triangles.
>
>I have two dictonaries:
>
>trione={number triangel:1.node,2.node,3.node}
>(key and values are integer)
>tritwo=as trione
>
>def common(list1,list2):
>    hit=0
>    for i in list1:
>        for j in list2:
>            if i==j:
>                hit +=1
>    return hit

This implies that the nodes can appear in varying orders, e.g. (1, 2, 3) is 
the same node as (3, 2, 1) etc. I don't know how much control you have over 
the *creation* of the ASCII files, but if you can somehow make sure the orders 
of the nodes in both files are the same, there's room for considerable 
optimization:

def common(list1,list2):
    hit=0
    if list1 == list2:
        hit = 3 # faking the original value
    return hit

If you cannot change the files, I'm not sure if it's feasible to give them a 
fixed order when reading the files into the dictionary, e.g. by sorting the 
node part. I haven't tested this, so this may just move the delay to a 
different place without resulting in any overall performance increase.

Aside from this, I haven't found many possibilities for optimization. I made a 
test set and tried moving the tritwo.keys() out of the outer loop, and using 
lists instead of dicts... this did not result in better performance. :(  That 
does not mean nothing can be done though; maybe someone else on this list has 
a clever idea.

HTH,

--Hans Nowak





More information about the Python-list mailing list