Newbie: finding the key/index of the min/max element
Tim Peters
tim.one at comcast.net
Sun May 5 05:22:04 EDT 2002
[David Eppstein]
> ...
> I'm pretty sure it would usually be better just to sort the list and
> return the middle item, instead of using a linear-time median
> finding algorithm, despite the nonlinear big-O time of a sort.
Many years ago I had to write a linear-time selector in Pascal. It was so
horridly painful I never tried it again. But I'm old enough now to confront
my fears <wink>, so I tried it in Python, and was pleasantly surprised at
how easily it went. Its performance isn't bad, considering it's written in
Python, and how much effort went into speeding Python's C sort. Here are
the seconds each took for finding the median of various sizes of arrays of
random doubles (these are the means of 3 runs at each size):
n clever sort
------- ------ ------
2 0.000 0.000
3 0.000 0.000
5 0.000 0.000
9 0.000 0.000
17 0.000 0.000
33 0.000 0.000
65 0.001 0.000
129 0.002 0.000
257 0.002 0.000
513 0.004 0.001
1025 0.008 0.002
2049 0.018 0.004
4097 0.037 0.009
8193 0.088 0.020
16385 0.166 0.047
32769 0.303 0.106
65537 0.638 0.236
131073 1.276 0.528
262145 2.579 1.152
524289 5.245 2.523
1048577 10.674 5.460
2097153 22.063 11.863
I got bored then. This is using -O with current CVS Python, which is
somewhat zippier than currently released Pythons. For contrast, the Python
version doesn't take much longer to find the median than it takes just to
fill the array with random doubles (random.random() is coded in Python too).
There are obvious but unpleasant ways to speed the Python version (it does a
lot of data copying), which I'll skip.
exorcising-old-demons-ly y'rs - tim
# Find the rank'th-smallest value in a, in worst-case quadratic time.
def short_find(a, rank):
a.sort()
return a[rank - 1]
# Find the rank'th-smallest value in a, in worst-case linear time.
def find(a, rank):
n = len(a)
assert 1 <= rank <= n
if n <= 7:
return short_find(a, rank)
# Find median of median-of-7's.
medians = [short_find(a[i : i+7], 4) for i in xrange(0, n-6, 7)]
median = find(medians, (len(medians) + 1) // 2)
# Partition around the median.
# a[:i] <= median
# a[j+1:] >= median
i, j = 0, n-1
while i <= j:
while a[i] < median:
i += 1
while a[j] > median:
j -= 1
if i <= j:
a[i], a[j] = a[j], a[i]
i += 1
j -= 1
if rank <= i:
return find(a[:i], rank)
else:
return find(a[i:], rank - i)
def tryone(n, tries):
from random import random
from time import clock as now
x = [None] * n
median_rank = (n+1) // 2
sum1 = sum2 = 0.0
for i in range(tries):
for i in xrange(n):
x[i] = random()
y = x[:]
start = now()
got1 = find(x, median_rank)
elapsed = now() - start
sum1 += elapsed
start = now()
got2 = short_find(y, median_rank)
elapsed = now() - start
sum2 += elapsed
if got1 != got2:
print "OUCH got1 %r got2 %r" % (got1, got2)
return sum1, sum2
def drive(tries):
for i in range(23):
n = 2**i + 1
fast, slow = tryone(n, tries)
print "%8d %7.3f %7.3f" % (n, fast/tries, slow/tries)
if __name__ == "__main__":
drive(3)
PS: If you're more interested in expected time than worst-case time,
replacing the median-of-median-of-7s business with
median = short_find([a[0], a[-1], a[n//2]], 2)
yields a Python median-finder that's usually significantly faster than the
sort method starting in the range of 512K to 1M elements.
PPS: If you do care about worst-case time, the Python version can be made
significantly faster by boosting the median-of-7 gimmick to median-of-k for
larger odd values of k. Fun for the whole family: plot speedups against k
until you're all baffled <wink>.
More information about the Python-list
mailing list