Performance of list.index - how to speed up a silly algorithm?

Fri Apr 30 06:10:07 EDT 2010

>> Maybe this should be implemented in C. But I believe that the 
>> algorithm itself must be wrong (regardless of the language). I really 
>> think that I'm doing something wrong. Looks like my algorithm's 
>> processing time is not linear to the number of rows. Not even 
>> log(n)*n. There should be a more effective way to do this. But how?
>>
>> I had the idea of sorting the rows by a given dimension. Then it 
>> would be obvious to speed up the indexing part - for that dimension. 
>> PROBABLY sorting all rows would be faster than calling list.index() 
>> for each row. But... it does not seem very efficient either. What if 
>> I have 1 million rows and 10 dimensions? Do I sort 1 million rows on 
>> the disk 10 times? Some of you might have ran into the same problem 
>> before, and can tell me which is the most efficient way to do this.
>>
> The .index method does a linear search, checking on average 1/2 of the 
> items in the list. That's why it's so slow.
>
> In order to avoid that you could build a dict of each value in
> dimension_values[col_idx] and its index in a single pass so that it
> becomes a quick lookup.
Changed to dicts and hashed lookups. Now the processing time is O(n), 
and went up to 8000 rows/sec.

Probably I'll never want to process more than 1M rows. That will take 
about 125 seconds. Fair enough.

Thank you very much!

   Laszlo