Schwartzian Transform (was Re: Sort)
Alex Martelli
aleaxit at yahoo.com
Wed Dec 6 12:04:07 EST 2000
"Edward C. Jones" <edcjones at erols.com> wrote in message
news:3A2E5F2A.9E650D50 at erols.com...
> Alex Martelli wrote:
>
> > Faster for sorting a large list is a "Schwartzian Transform" approach:
> >
> > auxlist = [ (x[2], x) for x in thelist ]
> > auxlist.sort()
> > thelist = [ x[1] for x in auxlist ]
>
> What is a "Schwartzian transform"?
Exactly what I exemplify here (and you quote the example): to sort a
sequence X,
a. generate an auxiliary sequence Y, exactly as
long as X, each item Y[i] being a sequence of
the sort fields computed from X[i] followed
by X[i] itself
a1. if a stable-sort is needed, be sure to
insert i itself in Y[i], just before X[i]
b. Y.sort()
c. X = [ x[-1] for x in Y ]
where steps b and c could, I guess, also be expressed
in English, but it would be too cumbersome & verbose:-).
There are alternative, similar ideas. For example, you
could have i at the end of Y[i] NOT followed by X[i],
then step c would become
c. X = [ X[x[-1]] for x in Y ]
The technique is named after its inventor Randall
Schwartz, who thought it up in the context of Perl
(but it works even better in Python:-). The key
idea is that, in scripting languages such as Perl
and Python, 'sort' slows down a lot if each of its
O(N log N) comparison involves interpreting script
code, rather than just the fast, built-in "lexical
comparison" that the language provides.
Hence, we'd dearly like to just call
something.sort()
rather than
something.sort(lambda a, b: complex stuff
involving a and b)
but how could an unadorned sort produce the order
we want...?
Answer: if the sortfields that are to be computed
or extracted from the original item X[i] were
*prepended* to it, appropriately. Whence, pass a
of the Transform (pass c just 'undoes' that to get
the clean list right back).
Note that the possibly - complex transformations
yielding the sortkey are performed O(N) times, not
O(N log N) times. Whence the speedup.
Example: perform a stable sort, case-independent,
over the first 20 characters of each line of the
838-lines README.txt file that comes with Python
2.0.
Naive approach...:
def compare(a,b):
keya = a[:20].lower()
keyb = b[:20].lower()
return cmp(keya, keyb)
import time
ldata = open("D:/Python20/README.txt").readlines()
lines = ldata[:]
start1 = time.clock()
lines.sort(compare)
stend1 = time.clock()
# oops, this ISN'T stable, is it...? Pretty
# hard to make Python's sort stable _without_
# a Transform... "here I come to save the day":
tlins = ldata[:]
start2 = time.clock()
aulin = [ ( x[:20].lower(), i)
for x, i in zip(tlins, range(len(tlins))) ]
aulin.sort()
tlins = [ x[-1] for x in aulin]
stend2 = time.clock()
print stend1-start1
print stend2-start2
D:\PySym>python tesor.py
0.220513718779
0.0353952707969
D:\PySym>
Note how, even on this small dataset, the Transform
version is over 6 times faster (AND satisfies the
specs about the sort being stable, a nice plus:-).
An idiom WELL worth keeping in mind...
Alex
More information about the Python-list
mailing list