finding most common elements between thousands of multiple arrays.
Scott David Daniels
Scott.Daniels at Acm.Org
Mon Jul 6 09:59:44 EDT 2009
Peter Otten wrote:
> Scott David Daniels wrote:
>
>> Scott David Daniels wrote:
>
>> t = timeit.Timer('sum(part[:-1]==part[1:])',
>> 'from __main__ import part')
>
> What happens if you calculate the sum in numpy? Try
>
> t = timeit.Timer('(part[:-1]==part[1:]).sum()',
> 'from __main__ import part')
Good idea, I hadn't thought of adding numpy bools.
(part[:-1]==part[1:]).sum()
is only a slight improvement over
len(part[part[:-1]==part[1:]])
when there are few elements, but it is almost twice
as fast when there are a lot (reflecting the work
of allocating and copying).
>>> import numpy
>>> import timeit
>>> original = numpy.random.normal(0, 100, (1000, 1000)).astype(int)
>>> data = original.flatten()
>>> data.sort()
>>> t = timeit.Timer('sum(part[:-1]==part[1:])',
'from __main__ import part')
>>> u = timeit.Timer('len(part[part[:-1]==part[1:]])',
'from __main__ import part')
>>> v = timeit.Timer('(part[:-1]==part[1:]).sum()',
'from __main__ import part')
>>> part = data[::100]
>>> (part[:-1]==part[1:]).sum()
9390
>>> t.repeat(3, 10)
[0.56368281443587875, 0.55615057220961717, 0.55465764503594528]
>>> u.repeat(3, 1000)
[0.89576580263690175, 0.89276374511291579, 0.8937328626963108]
>>> v.repeat(3, 1000)
[0.24798598704592223, 0.24715431709898894, 0.24498979618920202]
>>>
>>> part = original.flatten()[::100]
>>> (part[:-1]==part[1:]).sum()
27
>>> t.repeat(3, 10)
[0.57576898739921489, 0.56410158274297828, 0.56988248506445416]
>>> u.repeat(3, 1000)
[0.27312186325366383, 0.27315007913011868, 0.27214492344683094]
>>> v.repeat(3, 1000)
[0.28410342655297427, 0.28374053126867693, 0.28318990262732768]
>>>
Net result: go back to former definition of candidates (a number,
not the actual entries), but calculate that number as matches.sum(),
not len(part[matches]).
Now the latest version of this (compressed) code:
> ...
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches) # count identified matches
> while candidates > N * 10: # 10 -- heuristic
> stride *= 2 # # heuristic increase
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches)
> while candidates < N * 3: # heuristic slop for long runs
> stride //= 2 # heuristic decrease
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches)
> former = None
> past = 0
> for value in sampled[matches]:
> ...
is:
...
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = matches.sum() # count identified matches
while candidates > N * 10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = matches.sum()
while candidates < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = matches.sum()
former = None
past = 0
for value in sampled[matches]:
...
Now I think I can let this problem go, esp. since it was
mclovin's problem in the first place.
--Scott David Daniels
Scott.Daniels at Acm.Org
More information about the Python-list
mailing list