Comparing lists - somewhat OT, but still ...

Sun Oct 16 19:28:44 EDT 2005

On Sun, 16 Oct 2005 20:28:55 +0200, Christian Stapfer wrote:

> Experiments
> (not just in computer science) are quite
> frequently botched. How do you discover
> botched experiments?

Normally by comparing them to the results of other experiments, and being
unable to reconcile the results. You may have heard the term "the
experiment was/was not replicable".

How do you discover whether your theory is correct? By comparing it to
itself, or by comparing it to experiment?

Of course you need some theoretical background in order to design your
experiment in the first place, otherwise you have no idea what you are
even looking for. And it is easy to mis-design an experiment, as your
student did, so that it is actually measuring X when you think it is
measuring Y. If you are about to argue against a naive view of the
scientific method where the scientist generates data in a mental vacuum,
don't bother, I understand that. 

But fundamentally, calculated Big O values of algorithms on their own are
of virtually zero use in choosing between actual functions. If I tell you
that algorithm X is O(f(N)), what have I told you about it?

Have I told if it is unacceptably slow for some particular data size? No.

Have I told you how much work it will take to implement it? No.

Have I told you how fast my implementation is? No.

Have I told you that it is faster than some other algorithm Y? No.

The only thing I have told you is that in some rough and ready fashion, if
I increase the size of my data N, the amount of work done will increase
very approximately like f(N).

This disconnect between what Big O *actually* means and how you are
recommending we use it makes REAL PRACTICAL DIFFERENCE, and not in a good
way.

Here is a real problem I had to solve some time ago. Without using
regular expressions, I needed to find the position of a target
string in a larger string, but there were multiple targets. I
needed to find the first one.

I ended up using something like:

(untested, and from memory)

def findmany(s, *targets):
    minoffset = (len(s)+1, "")
    for t in targets:
        p = s.find(t)
        if p != -1 and p < minoffset[0]:
            minoffset = (p, t)
    return minoffset[0]

You would look at that, realise that s.find() is O(N) or even O(N**2) -- I
forget which -- and dismiss this as Shlemiel the Painter's algorithm which
is O(N**2) or worse.

http://www.joelonsoftware.com/articles/fog0000000319.html

And you would be right in your theory -- but wrong in your conclusion.
Because then you would do what I did, which is spend hours or days writing
a "more efficient" O(N) searching function in pure Python, which ended up
being 100 times slower searching for a SINGLE target string than my
Shlemiel algorithm was searching for twenty targets. The speed benefit I
got from pushing the character matching from Python to C was so immense
that I couldn't beat it no matter how "efficient" the algorithm was.

If I had bothered to actually profile my original code, I would have
discovered that for any data I cared about, it worked not just acceptably
fast, but blindingly fast. Why would I care that if I had a terrabyte of
data to search for a million targets, it would scale badly? I was
searching text strings of less than a megabyte, for five or six targets.
The Big O analysis I did was completely, 100% correct, and completely,
100% useless. Not just useless in that it didn't help me, but it actually
hindered me, leading me to waste a day's work needlessly looking for a
"better algorithm".

-- 
Steven.