finding most common elements between thousands of multiple arrays.

Mon Jul 6 09:59:44 EDT 2009

Peter Otten wrote:
> Scott David Daniels wrote:
> 
>> Scott David Daniels wrote:
> 
>>      t = timeit.Timer('sum(part[:-1]==part[1:])',
>>                       'from __main__ import part')
> 
> What happens if you calculate the sum in numpy? Try
> 
> t = timeit.Timer('(part[:-1]==part[1:]).sum()',
>                  'from __main__ import part')

Good idea, I hadn't thought of adding numpy bools.
     (part[:-1]==part[1:]).sum()
is only a slight improvement over
     len(part[part[:-1]==part[1:]])
when there are few elements, but it is almost twice
as fast when there are a lot (reflecting the work
of allocating and copying).

 >>> import numpy
 >>> import timeit
 >>> original = numpy.random.normal(0, 100, (1000, 1000)).astype(int)
 >>> data = original.flatten()
 >>> data.sort()
 >>> t = timeit.Timer('sum(part[:-1]==part[1:])',
                      'from __main__ import part')
 >>> u = timeit.Timer('len(part[part[:-1]==part[1:]])',
                      'from __main__ import part')
 >>> v = timeit.Timer('(part[:-1]==part[1:]).sum()',
                      'from __main__ import part')

 >>> part = data[::100]
 >>> (part[:-1]==part[1:]).sum()
9390
 >>> t.repeat(3, 10)
[0.56368281443587875, 0.55615057220961717, 0.55465764503594528]
 >>> u.repeat(3, 1000)
[0.89576580263690175, 0.89276374511291579, 0.8937328626963108]
 >>> v.repeat(3, 1000)
[0.24798598704592223, 0.24715431709898894, 0.24498979618920202]
 >>>
 >>> part = original.flatten()[::100]
 >>> (part[:-1]==part[1:]).sum()
27
 >>> t.repeat(3, 10)
[0.57576898739921489, 0.56410158274297828, 0.56988248506445416]
 >>> u.repeat(3, 1000)
[0.27312186325366383, 0.27315007913011868, 0.27214492344683094]
 >>> v.repeat(3, 1000)
[0.28410342655297427, 0.28374053126867693, 0.28318990262732768]
 >>>

Net result: go back to former definition of candidates (a number,
not the actual entries), but calculate that number as matches.sum(),
not len(part[matches]).

Now the latest version of this (compressed) code:
 >     ...
 >     sampled = data[::stride]
 >     matches = sampled[:-1] == sampled[1:]
 >     candidates = sum(matches) # count identified matches
 >     while candidates > N * 10: # 10 -- heuristic
 >         stride *= 2 # # heuristic increase
 >         sampled = data[::stride]
 >         matches = sampled[:-1] == sampled[1:]
 >         candidates = sum(matches)
 >     while candidates < N * 3: # heuristic slop for long runs
 >         stride //= 2 # heuristic decrease
 >         sampled = data[::stride]
 >         matches = sampled[:-1] == sampled[1:]
 >         candidates = sum(matches)
 >     former = None
 >     past = 0
 >     for value in sampled[matches]:
 >         ...
is:
       ...
       sampled = data[::stride]
       matches = sampled[:-1] == sampled[1:]
       candidates = matches.sum() # count identified matches
       while candidates > N * 10: # 10 -- heuristic
           stride *= 2 # # heuristic increase
           sampled = data[::stride]
           matches = sampled[:-1] == sampled[1:]
           candidates = matches.sum()
       while candidates < N * 3: # heuristic slop for long runs
           stride //= 2 # heuristic decrease
           sampled = data[::stride]
           matches = sampled[:-1] == sampled[1:]
           candidates = matches.sum()
       former = None
       past = 0
       for value in sampled[matches]:
           ...

Now I think I can let this problem go, esp. since it was
mclovin's problem in the first place.

--Scott David Daniels
Scott.Daniels at Acm.Org