finding most common elements between thousands of multiple arrays.

Scott David Daniels Scott.Daniels at Acm.Org
Mon Jul 6 09:59:44 EDT 2009

```Peter Otten wrote:
> Scott David Daniels wrote:
>
>> Scott David Daniels wrote:
>
>>      t = timeit.Timer('sum(part[:-1]==part[1:])',
>>                       'from __main__ import part')
>
> What happens if you calculate the sum in numpy? Try
>
> t = timeit.Timer('(part[:-1]==part[1:]).sum()',
>                  'from __main__ import part')

(part[:-1]==part[1:]).sum()
is only a slight improvement over
len(part[part[:-1]==part[1:]])
when there are few elements, but it is almost twice
as fast when there are a lot (reflecting the work
of allocating and copying).

>>> import numpy
>>> import timeit
>>> original = numpy.random.normal(0, 100, (1000, 1000)).astype(int)
>>> data = original.flatten()
>>> data.sort()
>>> t = timeit.Timer('sum(part[:-1]==part[1:])',
'from __main__ import part')
>>> u = timeit.Timer('len(part[part[:-1]==part[1:]])',
'from __main__ import part')
>>> v = timeit.Timer('(part[:-1]==part[1:]).sum()',
'from __main__ import part')

>>> part = data[::100]
>>> (part[:-1]==part[1:]).sum()
9390
>>> t.repeat(3, 10)
[0.56368281443587875, 0.55615057220961717, 0.55465764503594528]
>>> u.repeat(3, 1000)
[0.89576580263690175, 0.89276374511291579, 0.8937328626963108]
>>> v.repeat(3, 1000)
[0.24798598704592223, 0.24715431709898894, 0.24498979618920202]
>>>
>>> part = original.flatten()[::100]
>>> (part[:-1]==part[1:]).sum()
27
>>> t.repeat(3, 10)
[0.57576898739921489, 0.56410158274297828, 0.56988248506445416]
>>> u.repeat(3, 1000)
[0.27312186325366383, 0.27315007913011868, 0.27214492344683094]
>>> v.repeat(3, 1000)
[0.28410342655297427, 0.28374053126867693, 0.28318990262732768]
>>>

Net result: go back to former definition of candidates (a number,
not the actual entries), but calculate that number as matches.sum(),
not len(part[matches]).

>     ...
>     sampled = data[::stride]
>     matches = sampled[:-1] == sampled[1:]
>     candidates = sum(matches) # count identified matches
>     while candidates > N * 10: # 10 -- heuristic
>         stride *= 2 # # heuristic increase
>         sampled = data[::stride]
>         matches = sampled[:-1] == sampled[1:]
>         candidates = sum(matches)
>     while candidates < N * 3: # heuristic slop for long runs
>         stride //= 2 # heuristic decrease
>         sampled = data[::stride]
>         matches = sampled[:-1] == sampled[1:]
>         candidates = sum(matches)
>     former = None
>     past = 0
>     for value in sampled[matches]:
>         ...
is:
...
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = matches.sum() # count identified matches
while candidates > N * 10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = matches.sum()
while candidates < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = matches.sum()
former = None
past = 0
for value in sampled[matches]:
...

Now I think I can let this problem go, esp. since it was
mclovin's problem in the first place.

--Scott David Daniels
Scott.Daniels at Acm.Org

```