[Tutor] Looking for duplicates within a list

Sat Jun 12 04:38:54 CEST 2010

On Sat, 12 Jun 2010 05:59:45 am Hugo Arts wrote:
[...]

Putting aside all the theoretical arguments, and getting to an actual, 
real-life example:

> > So somebody used an algorithm which they KNEW was inefficient and
> > slow, it had been there for years, affecting who knows how many
> > people, until eventually somebody was annoyed sufficiently to do
> > something about it. And the sad thing is that avoiding repeated
> > string concatenation is easy and there was no need for it in the
> > first place.
>
> Estimating the size of your input is sometimes hard. That's a valid
> point. OTOH, that line of code takes maybe 10 seconds to write, and 
> after profiling reveals it to be slow (you *are* profiling your code,
> right?) you can easily replace it with something more appropriate. Or
> your successor can, since he'll have no problem figuring out what it
> does.

I beg to differ. The incident with httplib is actual, real-life evidence 
that your caviler attitude towards poor algorithms is bad advice. You 
wave your hands and say "Oh well, it doesn't matter, because somebody 
will find the slowdown and easily fix it". But in real life, it isn't 
that simple, not even in Open Source software let alone proprietary 
software where you might not have access to the source and the vendor 
might not care. (Has Microsoft yet fixed the ridiculously slow 
behaviour of the Trash Can when it has thousands of items in it?)

I was mistaken when I said Chris Withers reported this performance bug 
in September. He actually first raised it with the Python-Dev team in 
early August:

http://mail.python.org/pipermail/python-dev/2009-August/thread.html#91125
http://mail.python.org/pipermail/python-dev/2009-September/thread.html#91584

So this took AT LEAST a month elapsed time, with multiple people 
contributing, at least two different bug reports involved (and possibly 
five), and who knows how many hours debugging to solve this. Once the 
issue was identified, the solution was trivial:

http://svn.python.org/view/python/trunk/Lib/httplib.py?r1=74523&r2=74655

but it took a lot of effort to identify the issue.

As Guido van Rossum himself said, this was an embarrassment. The fix was 
entirely trivial, and there was no good reason to write the Shlemiel 
algorithm in the first place, especially since the author apparently 
knew it was a Shlemiel algorithm. Perhaps he too thought it was "more 
elegant" and that somebody else would "easily" deal with the problems 
that it caused.

The poorly-performing httplib code has been in at least three major 
releases of CPython (2.4 to 2.6) and possibly more, but apparently due 
to differences in realloc it only affected Windows users. So once 
you've run into the bug, your *second* problem is that other people 
say "It works for me, it must be your network".

It is not true that "he'll have no problem figuring out what it does". 
In code of any complexity, you can run into multiple problems and 
wild-goose-chases. This is a good example. Read the thread and see how 
many different ideas people had: they blamed the proxy, they blamed a 
pre-existing bug in sockets, they blamed buffer sizes. Nobody publicly 
blamed "third-party drivers", but I bet you somebody was thinking it.

I think you have a very rosy view of how easy it is to debug performance 
issues. Debugging is hard. Debugging performance problems is ten times 
harder.

Avoiding bad algorithms isn't premature optimization, it is thinking 
ahead. There's a fine line between them, I agree, but with the risks of 
hard-to-debug performance degradation caused by O(N**2) algorithms you 
should have a *seriously* compelling reason for using one, and not just 
because it's "good enough for small N" or "its a one-liner".

On modern desktop or laptop hardware, I would expect small N to mean 
multiple hundreds or even thousands, not a dozen or two. This isn't 
1985 when desktop computers couldn't count past 32768 and 1000 items 
was a lot. Today, large N is tens of millions. Huge is hundreds of 
millions, at which point I start worrying about paging into virtual 
memory more than the algorithm, because memory paging is an O(N**2) (or 
worse!) algorithm too.

[...]
> That wasn't the point I was trying to make. My point was that speed
> is not always the primary concern, and if it is, you shouldn't be
> writing code in python anyway, since it is always possible to write a
> program in C that performs better.

It is noteworthy the PyPy people use JIT compilation of Python code to 
beat the pants off CPython speed, and their audacious aim is to be 
*faster* than C, at least for some things. It is possible to write 
Python programs that run faster than the equivalent program written in 
optimized C.

http://news.ycombinator.com/item?id=1379382

> Alan's code makes a trade-off between performance and readability.
> I'd call it the easiest to read solution so far. 

Really? I think its readability is pretty poor, and nowhere near as 
clear as an explicit loop. Especially for a beginner who might not even 
know what a generator expression is. There's too much stuff in one 
line, making it hard to read. It does the same work multiple times 
instead of once. Instead of the one-liner:

counts = set(( item, mylist.count(item)) for item in mylist if 
mylist.count(item) > 1)

I think a two-liner is *far* clearer:

counts = ((item, mylist.count(item)) for item in set(mylist))
counts = [t for t in counts if t[1] > 1]

It's still has the awful O(N**2) performance, but it is nearly 25% 
faster (for N=18) by my tests, and is easier to read. And by some happy 
accident, it keeps the order of the original sorted list:

>>> mylist = [1, 2, 2, 3, 3, 5, 6, 19, 21, 21, 23,
... 25, 25, 25, 26, 26, 31, 39] 
>>> counts = ((item, mylist.count(item)) for item in set(mylist))
>>> counts = [t for t in counts if t[1] > 1]
>>> counts
[(2, 2), (3, 2), (21, 2), (25, 3), (26, 2)]

-- 
Steven D'Aprano