Some optimization tale

Sat Dec 27 11:36:46 EST 2003

A while ago, I've posted a recipie about finding a common prefix to a list
of strings. While the recipie itself is quite bad (I have to admit) and I
didn't know at that time that this problem was solved already in the
os.path module.
The interesting part can be found in the commentaries, as this turned out to
be a quest for the most efficient algorithm to solve this particular
problem.
All of this can be found at
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252177

What I was most surprised at was the inefficency of the trivial solutions
(and that the right algorithm makes indeed a difference).

If you have read that far, you might be interested in the actual algorithms.
(If you are interested in the one-liner versions, please have a look at the
recipie)
Here they are:

-- os.path.commonprefix --------------------------------------------------

def f1(m):
    "Given a list of pathnames, returns the longest common leading
component"
    if not m: return ''
    prefix = m[0]
    for item in m:
        for i in range(len(prefix)):
            if prefix[:i+1] != item[:i+1]:
                prefix = prefix[:i]
                if i == 0:
                    return ''
                break
    return prefix

---------------------------------------------------------------------------

The problem with this algorithm is the copying of all those small strings.
This can be easily fixed

--- optimized os.path.commonprefix ----------------------------------------

def f2(m):
    "Given a list of pathnames, returns the longest common leading
component"
    if not m: return ''
    if len(m) == 1: return m[0]
    prefix = m[0]
    for i in xrange(len(prefix)):
        for item in m[1:]:
            if prefix[i] != item[i]:
                return prefix[:i]
    return prefix[:i]

---------------------------------------------------------------------------

Now it gets interesting. It turns out that the above algorithms doesn't
scale well.

Some anonymous submitter suggested the following

---- by anonymous ---------------------------------------------------------

def f3(seq):
    if not seq:return ""
    seq.sort()
    s1, s2 = seq[0], seq[-1]
    l = min(len(s1), len(s2))
    if l == 0 :
        return ""
    for i in xrange(l) :
        if s1[i] != s2[i] :
            return s1[0:i]
    return s1[0:l]

---------------------------------------------------------------------------

It is just not nessesary to compare all strings in the list. It is enough to
sort the list first and then compare the first and the last element. Even
though the 'sort' algorithm is coded in C and is therefore quite fast, the
order of runtime has changed.

Michael Dyck then pointed out that instead of using 'sort', 'min' and 'max'
should be used. While tests suggest that this is true, I have no idea why
that should be, since finding a minimum or maximum uses some sorting anyway
(if we don't have some quantum computer at our hands), so, my reasoning
would be that sorting once should be faster than computing both, maximum
and minimum.

You might have realized that the optimization so far was done one the number
of strings. There is still another dimension to optimize in and that is the
actual string comparing.
Raymond Hettinger suggests using a binary search:

---------------------------------------------------------------------------
def f4(m):
    "Given a list of pathnames, returns the longest common leading
component"
    if not m: return ''
    a, b = min(m), max(m)
    lo, hi = 0, min(len(a), len(b))
    while lo < hi:
        mid = (lo+hi)//2 + 1
        if a[lo:mid] == b[lo:mid]:
            lo = mid
        else:
            hi = mid - 1
    return a[:hi]
----------------------------------------------------------------------------

To give you some ideas about the running times:

   f1          f3          f4          # of strings

 ['0.131058', '0.021471', '0.012050']       2
 ['0.214896', '0.041648', '0.012963']       4
 ['0.401236', '0.020444', '0.014707']       8
 ['0.841738', '0.026415', '0.018589']      16
 ['1.670606', '0.039348', '0.029020']      32
 ['3.184446', '0.065657', '0.044247']      64
 ['6.257635', '0.123510', '0.072568']     128

Every calculation was done 200 times.
Furthermore, the testset consists of only two different strings, so the
binary search part of Raymonds solution comes only in as a static factor.

Anyway, the fastest solution is up to a 100 times faster than the trivial
one.

Cheers

Stephan