time consuming loops over lists

John Machin sjmachin at lexicon.net
Tue Jun 7 18:25:28 EDT 2005


Diez B. Roggisch wrote:
> querypk at gmail.com wrote:
> 
>> X-No-Archive: yes
>> Can some one help me improve this block of code...this jus converts the
>> list of data into tokens based on the range it falls into...but it
>> takes a long time.Can someone tell me what can i change to improve
>> it...
>>
> 
> 
>>                if data[i] in xrange(rngs[j],rngs[j+1]):
>>
> 
> That's a bummer: You create a list and then search linearily in in - 
> where all you want to know is
> 
> if rngs[j] <= data[i] < rngs[j+1]
> 
> Attached is a script that does contain  your old and my enhanced version 
> - and shows that the results are equal. Running only your version takes 
> ~35s, where mine uses ~1s!!!
> 
> Another optimization im too lazy now would be to do sort of a "tree 
> search" of data[i] in rngs - as the ranges are ordered, you could find 
> the proper one in log_2(len(rngs)) instead of len(rngs)/2.
> 
> Additional improvements can be achieved when data is sorted - but that 
> depends on your application and actually sizes of data.
> 
> Diez
> 
[snip]
Make these changes/additions:
=================================================
# from Numeric import *
# +1 for Overkill-of-the-Year
def zeros(n):
     return n * [0.0]

def Tkz3(tk, data, no_of_bins):
      # indent 5 ???????
      # change other funcs to have no_of_bins as an arg
      tkns = []
      dmax = max(data)+1
      dmin = min(data)
      rng = int(ceil(abs((dmax - dmin)/(no_of_bins*1.0))))
      for datum in data:
           j = (datum - dmin) // rng
           tkns.append( str(tk)+str(j) )
      return tkns

for func in (Tkz, Tkz2, Tkz3):
     t0 = time.time()
     result = func('A', data, nbins)
     t1 = time.time()
     results.append(result)
     print "function %s took %.2f seconds to produce %d tokens" % 
(func.__name__, t1 - t0, len(result))

allsame = results [0] == results[1] == results[2]
print "\nIdentical results:", allsame
=======================================================
and get this:

C:\junk>ranges.py
C:\junk\ranges.py:46: DeprecationWarning: integer argument expected, got 
float
   if data[i] in xrange(rngs[j], rngs[j+1]):
function Tkz took 12.13 seconds to produce 12292 tokens
function Tkz2 took 0.08 seconds to produce 12292 tokens
function Tkz3 took 0.02 seconds to produce 12292 tokens

Identical results: True

==================================================
using Python 2.4.1 (win32) on a 3.2Ghz Intel P4
==================================================

Notes: (1) The OP's code as well as being deliciously O(N**2) depends on 
the data beint *integers* -- otherwise it goes rapidly pear-shaped; see 
the deprecation warning above, and try running it with data = [0.00001* 
x for x in range(20, 12312)]. Consequently the use of math.* is a /mild/ 
overkill.

(2) The OP's code produces bins of equal nominal width but depending in 
the range, the population of the last bin may be much less than the 
population of the other bins. Another way of doing it would be to 
produce bin widths so that the populations were more evenly spread. In 
any case determining the appropriate bin could still be done by formula 
rather than by searching.

Cheers,
John



More information about the Python-list mailing list