finding most common elements between thousands of multiple arrays.
Scott David Daniels
Scott.Daniels at Acm.Org
Sun Jul 5 20:30:58 EDT 2009
Scott David Daniels wrote:
> ... Here's a heuristic replacement for my previous frequency code:
> I've tried to mark where you could fudge numbers if the run time
> is at all close.
Boy, I cannot let go. I did a bit of a test checking for cost to
calculated number of discovered samples, and found after:
import timeit
import numpy
original = numpy.random.random(0, 100, (1000, 1000)).astype(int)
data = original.flatten()
data.sort()
part = data[::100]
t = timeit.Timer('sum(part[:-1]==part[1:])',
'from __main__ import part')
v = timeit.Timer('len(part[part[:-1]==part[1:]])',
'from __main__ import part')
I got:
>>> t.repeat(3, 10)
[0.58319842326318394, 0.57617574300638807, 0.57831819407238072]
>>> v.repeat(3, 1000)
[0.93933027801040225, 0.93704535073584339, 0.94096260837613954]
So, len(part[mask]) is almost 50X faster! I checked:
>>> sum(part[:-1]==part[1:])
9393
>>> len(part[part[:-1]==part[1:]])
9393
That's an awful lot of matches, so I with high selectivity:
data = original.flatten() # no sorting, so runs missing
part = data[::100]
>>> t.repeat(3, 10)
[0.58641335700485797, 0.58458854407490435, 0.58872594142576418]
>>> v.repeat(3, 1000)
[0.27352554584422251, 0.27375686015921019, 0.27433291102624935]
about 200X faster
>>> len(part[part[:-1]==part[1:]])
39
>>> sum(part[:-1]==part[1:])
39
So my new version of this (compressed) code:
> ...
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches) # count identified matches
> while candidates > N * 10: # 10 -- heuristic
> stride *= 2 # # heuristic increase
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches)
> while candidates < N * 3: # heuristic slop for long runs
> stride //= 2 # heuristic decrease
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches)
> former = None
> past = 0
> for value in sampled[matches]:
> ...
is:
...
sampled = data[::stride]
candidates = sampled[sampled[:-1] == sampled[1:]]
while len(candidates) > N * 10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
candidates = sampled[sampled[:-1] == sampled[1:]]
while len(candidates) < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
candidates = sampled[sampled[:-1] == sampled[1:]]
former = None
past = 0
for value in candidates:
...
This change is important, for we try several strides before
settling on a choice, meaning the optimization can be valuable.
This also means we could be pickier at choosing strides (try
more values), since checking is cheaper than before.
Summary: when dealing with numpy, (or any bulk <-> individual values
transitions), try several ways that you think are equivalent and
_measure_. In the OODB work I did we called this "impedance mismatch,"
and it is likely some boundary transitions are _much_ faster than
others. The sum case is one of them; I am getting numpy booleans
back, rather than numpy booleans, so conversions aren't going fastpath.
--Scott David Daniels
Scott.Daniels at Acm.Org
More information about the Python-list
mailing list