Speed revisited

Sat Jan 8 21:25:56 EST 2005

Bulba! wrote:
> On 4 Jan 2005 14:33:34 -0800, "John Machin" <sjmachin at lexicon.net>
> wrote:
>
> >(b) Fast forwarding 30+ years, let's look at the dictionary method,
> >assuming you have enough memory to hold all your data at once:
> >
> >Step 1: read the "left" table; for each row, if english not in
mydict,
> >then do mydict[english] = MyObject(). In any case, do
> >mydict[english].left_list.append(row)
> >Step 2: same for the "right" table.
> >Step 3: for english, obj in mydict.iteritems(): process(english,
obj)
>
> >As your datasets are stored in MS Excel spreadsheets, N < 64K so
> >whether your solution is O(N) or O(N*log(N)) doesn't matter too
much.
> >You are however correct to avoid O(N**2) solutions.
>
> Following advice of two posters here (thanks) I have written two
> versions of  the same program, and both of them work, but the
> difference in speed is drastic, about 6 seconds vs 190 seconds
> for about 15000 of processed records, taken from 2 lists of
> dictionaries.
>
> I've read "Python Performance Tips" at
>
> http://manatee.mojam.com/~skip/python/fastpython.html
>
> ..but still don't understand why the difference is so big.
>
> Both versions use local variables, etc. Both have their
> lists initially sorted. Both essentially use a loop with
> conditional for comparison,
> then process the record in the
> same way.

"process the record in the same way"??? That's an interesting use of
"same".

> The overhead of second version is that it also
> uses cmp() function and two additional integer
> variables - that should not slow the program _so much_.
>
> I have measured the run of snippet 2 with time checkpoints
> written to a log (write time delta to log every 200 records),
> even made a graph of time deltas in spreadsheet and in fact
> snippet 2 seems after initial slowdown looks exactly linear,
> like  that:
>
> ^ (time)
> |
> |  /-----------
> | /
> |/
> ---------------> (# of records written)
>
> So yes, it would scale to big files.

On your empirical evidence, as presented. However, do read on ...

>However, why is it so
> frigging slow?!

Mainly, because you are (unnecessarily) deleting the first item of a
list. This requires copying the remaining items. It is O(N), not O(1).
You are doing this O(N) times, so the overall result is O(N**2). Your
graph has no obvious explanation; after how many cycles does the speed
become constant?

Secondly, you are calling cmp() up to THREE times when once is enough.
Didn't it occur to you that your last elif needed an else to finish it
off, and the only possible action for the else suite was "assert
False"?

It would appear after reading your "snippet 2" a couple of times that
you are trying to implement the old 3-tape update method.

It would also appear that you are assuming/hoping that there are never
more than one instance of a phrase in either list.

You need something a little more like the following.

Note that in your snippet2 it was not very obvious what you want to do
in the case where a phrase is in "new" but not in "old", and vice versa
-- under one circumstance (you haven't met "end of file") you do
nothing but in the the other circumstance you do something but seem to
have not only a polarity problem but also a copy-paste-edit problem. In
the following code I have abstracted the real requirements as
handle_XXX_unmatched()

!o = n = 0
!lenold = len(oldl)
!lennew = len(newl)
!while o < lenold and n < lennew:
!    cmp_result = cmp(oldl[o]['English'], newl[n]['English'])
!    if cmp_result == 0:
!        # Eng phrase is in both "new" and "old"
!        cm.writerow(matchpol(oldl[o], newl[n]))
!        o += 1
!        n += 1
!    elif cmp_result < 0:
!        # Eng phrase is in "old", not in "new"
!	 handle_old_unmatched(o)
!        o += 1
!    else:
!        assert cmp_result > 0 # :-)
!	 # Eng phrase is in "new", not in "old"
!	 handle_new_unmatched(n)
!        n += 1
!while o < lenold:
!    # EOF on new, some old remain
!    handle_old_unmatched(o)
!    o += 1
!while n < lennew:
!    # EOF on old, some new remain
!    handle_new_unmatched(n)
!    n += 1

Some general hints: Try stating your requirements clearly, to yourself
first and then to us. Try to ensure that your code is meeting those
requirements before you bother timing it. Try not to use single-letter
names -- in particularly using l (that's "L".lower(), not 1 i.e.
str(4-3)) is barf-inducing and makes people likely not to want to read
your code.

HTH,
John