[Tutor] Looking for duplicates within a list
Steven D'Aprano
steve at pearwood.info
Sat Jun 12 04:38:54 CEST 2010
On Sat, 12 Jun 2010 05:59:45 am Hugo Arts wrote:
[...]
Putting aside all the theoretical arguments, and getting to an actual,
real-life example:
> > So somebody used an algorithm which they KNEW was inefficient and
> > slow, it had been there for years, affecting who knows how many
> > people, until eventually somebody was annoyed sufficiently to do
> > something about it. And the sad thing is that avoiding repeated
> > string concatenation is easy and there was no need for it in the
> > first place.
>
> Estimating the size of your input is sometimes hard. That's a valid
> point. OTOH, that line of code takes maybe 10 seconds to write, and
> after profiling reveals it to be slow (you *are* profiling your code,
> right?) you can easily replace it with something more appropriate. Or
> your successor can, since he'll have no problem figuring out what it
> does.
I beg to differ. The incident with httplib is actual, real-life evidence
that your caviler attitude towards poor algorithms is bad advice. You
wave your hands and say "Oh well, it doesn't matter, because somebody
will find the slowdown and easily fix it". But in real life, it isn't
that simple, not even in Open Source software let alone proprietary
software where you might not have access to the source and the vendor
might not care. (Has Microsoft yet fixed the ridiculously slow
behaviour of the Trash Can when it has thousands of items in it?)
I was mistaken when I said Chris Withers reported this performance bug
in September. He actually first raised it with the Python-Dev team in
early August:
http://mail.python.org/pipermail/python-dev/2009-August/thread.html#91125
http://mail.python.org/pipermail/python-dev/2009-September/thread.html#91584
So this took AT LEAST a month elapsed time, with multiple people
contributing, at least two different bug reports involved (and possibly
five), and who knows how many hours debugging to solve this. Once the
issue was identified, the solution was trivial:
http://svn.python.org/view/python/trunk/Lib/httplib.py?r1=74523&r2=74655
but it took a lot of effort to identify the issue.
As Guido van Rossum himself said, this was an embarrassment. The fix was
entirely trivial, and there was no good reason to write the Shlemiel
algorithm in the first place, especially since the author apparently
knew it was a Shlemiel algorithm. Perhaps he too thought it was "more
elegant" and that somebody else would "easily" deal with the problems
that it caused.
The poorly-performing httplib code has been in at least three major
releases of CPython (2.4 to 2.6) and possibly more, but apparently due
to differences in realloc it only affected Windows users. So once
you've run into the bug, your *second* problem is that other people
say "It works for me, it must be your network".
It is not true that "he'll have no problem figuring out what it does".
In code of any complexity, you can run into multiple problems and
wild-goose-chases. This is a good example. Read the thread and see how
many different ideas people had: they blamed the proxy, they blamed a
pre-existing bug in sockets, they blamed buffer sizes. Nobody publicly
blamed "third-party drivers", but I bet you somebody was thinking it.
I think you have a very rosy view of how easy it is to debug performance
issues. Debugging is hard. Debugging performance problems is ten times
harder.
Avoiding bad algorithms isn't premature optimization, it is thinking
ahead. There's a fine line between them, I agree, but with the risks of
hard-to-debug performance degradation caused by O(N**2) algorithms you
should have a *seriously* compelling reason for using one, and not just
because it's "good enough for small N" or "its a one-liner".
On modern desktop or laptop hardware, I would expect small N to mean
multiple hundreds or even thousands, not a dozen or two. This isn't
1985 when desktop computers couldn't count past 32768 and 1000 items
was a lot. Today, large N is tens of millions. Huge is hundreds of
millions, at which point I start worrying about paging into virtual
memory more than the algorithm, because memory paging is an O(N**2) (or
worse!) algorithm too.
[...]
> That wasn't the point I was trying to make. My point was that speed
> is not always the primary concern, and if it is, you shouldn't be
> writing code in python anyway, since it is always possible to write a
> program in C that performs better.
It is noteworthy the PyPy people use JIT compilation of Python code to
beat the pants off CPython speed, and their audacious aim is to be
*faster* than C, at least for some things. It is possible to write
Python programs that run faster than the equivalent program written in
optimized C.
http://news.ycombinator.com/item?id=1379382
> Alan's code makes a trade-off between performance and readability.
> I'd call it the easiest to read solution so far.
Really? I think its readability is pretty poor, and nowhere near as
clear as an explicit loop. Especially for a beginner who might not even
know what a generator expression is. There's too much stuff in one
line, making it hard to read. It does the same work multiple times
instead of once. Instead of the one-liner:
counts = set(( item, mylist.count(item)) for item in mylist if
mylist.count(item) > 1)
I think a two-liner is *far* clearer:
counts = ((item, mylist.count(item)) for item in set(mylist))
counts = [t for t in counts if t[1] > 1]
It's still has the awful O(N**2) performance, but it is nearly 25%
faster (for N=18) by my tests, and is easier to read. And by some happy
accident, it keeps the order of the original sorted list:
>>> mylist = [1, 2, 2, 3, 3, 5, 6, 19, 21, 21, 23,
... 25, 25, 25, 26, 26, 31, 39]
>>> counts = ((item, mylist.count(item)) for item in set(mylist))
>>> counts = [t for t in counts if t[1] > 1]
>>> counts
[(2, 2), (3, 2), (21, 2), (25, 3), (26, 2)]
--
Steven D'Aprano
More information about the Tutor
mailing list