finding most common elements between thousands of multiple arrays.
Scott David Daniels
Scott.Daniels at Acm.Org
Sat Jul 4 18:50:23 EDT 2009
mclovin wrote:
> On Jul 4, 12:51 pm, Scott David Daniels <Scott.Dani... at Acm.Org> wrote:
>> mclovin wrote:
>>> OK then. I will try some of the strategies here but I guess things
>>> arent looking too good. I need to run this over a dataset that someone
>>> pickled. I need to run this 480,000 times so you can see my
>>> frustration. So it doesn't need to be "real time" but it would be nice
>>> it was done sorting this month.
>>> Is there a "bet guess" strategy where it is not 100% accurate but much
>>> faster?
>>
>> Well, I timed a run of a version of mine, and the scan is approx 5X
>> longer than the copy-and-sort. Time arr_of_arr.flatten().sort() to
>> see how quickly the copy and sort happens.So you could try a variant
>> exploiting the following property:
>> If you know the minimum length of a run that will be in the top 25,
>> then the value for each of the most-frequent run entries must show up at
>> positions n * stride and (n + 1) * stride (for some n). That should
>> drastically reduce the scan cost, as long as stride is reasonably large....
>>
>> sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
>> So there are only 1000 points to investigate.
>> With any distribution other than uniform, that should go _way_ down.
>> So just pull out those points, use bisect to get their frequencies, and
>> feed those results into the heap accumulation.
>>
>> --Scott David Daniels
>
> I dont quite understand what you are saying but I know this: the times
> the most common element appears varies greatly. Sometimes it appears
> over 1000 times, and some times it appears less than 50. It all
> depends on the density of the arrays I am analyzing.
Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.
def frequency(arr_of_arr, N, stride=100)
'''produce (freq, value) pairs for data in arr_of_arr.
Tries to produce > N pairs. stride is a guess at half
the length of the shortest run in the top N runs.
'''
# if the next two lines are too slow, this whole approach is toast
data = arr_of_arr.flatten() # big allocation
data.sort() # a couple of seconds for 25 million ints
# stride is a length forcing examination of a run.
sampled = data[::stride]
# Note this is a view into data, and is still sorted.
# We know that any run of length 2 * stride - 1 in data _must_ have
# consecutive entries in sampled. Compare them "in parallel"
matches = sampled[:-1] == sampled[1:]
# matches is True or False for stride-separated values from sampled
candidates = sum(matches) # count identified matches
# while candidates is huge, keep trying with a larger stride
while candidates > N *10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)
# if we got too few, move stride down:
while candidates < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)
# Here we have a "nice" list of candidates that is likely
# to include every run we actually want. sampled[matches] is
# the sorted list of candidate values. It may have duplicates
former = None
past = 0
# In the loop here we only use sampled to the pick values we
# then go find in data. We avoid checking for same value twice
for value in sampled[matches]:
if value == former:
continue # A long run: multiple matches in sampled
former = value # Make sure we only try this one once
# find the beginning of the run
start = bisect.bisect_left(data, value, past)
# find the end of the run (we know it is at least stride long)
past = bisect.bisect_right(data, value, start + stride)
yield past - start, value # produce frequency, value data
--Scott David Daniels
Scott.Daniels at Acm.Org
More information about the Python-list
mailing list