[Tutor] Looking for duplicates within a list

Fri Jun 11 19:28:30 CEST 2010

On Sat, 12 Jun 2010 02:18:52 am Hugo Arts wrote:
> On 11 jun 2010, at 17:49, Steven D'Aprano <steve at pearwood.info> wrote:
> > On Sat, 12 Jun 2010 12:58:19 am Alan Gauld wrote:
> >> Have you looked at the count method of lists?
> >>
> >> Something like:
> >>
> >> counts = set(( item, mylist.count(item)) for item in mylist if
> >> mylist.count(item) > 1)
> >
> > That's a Shlemiel the Painter algorithm.
> >
> > http://www.joelonsoftware.com/articles/fog0000000319.html
>
> It is, but it's also very elegant and simple to understand. 

Your idea of elegant and simple is not the same as mine. To me, I see 
unnecessary work and unneeded complexity. A generator expression for a 
beginner having trouble with the basics? Calling mylist.count(item) 
*twice* for every item?! How is that possibly elegant?

> And it 
> works fine on small inputs. Everything is fast for small n.

[pedantic]
Not everything. Some things are inherently slow, for example:

http://en.wikipedia.org/wiki/Busy_beaver

A five-state Busy Beaver machine takes 47,176,870 steps to return, and a 
six-state Busy Beaver takes 10**21132 steps to return.

A slightly less extreme example is Ackermann's Function:

http://en.wikipedia.org/wiki/Ackermann's_function

where (for example) A(4,2) has over 19,000 digits. Calculating the 
number of recursive steps needed to calculate that value is left as an 
exercise.
[/pedantic]

These are extreme examples, but the point is that there are tasks which 
are hard even for small N. "Find N needles in a haystack" remains hard 
even for N=1.

(And before you suggest using a magnet, it's a *plastic* needle.)

[...]
> I guess what I'm trying to say is: using code that performs bad in
> situations that won't be encountered anyway is not inherently bad.

The problem is that situations that won't be encountered often *are* 
encountered, long after the coder who introduced the poorly-performing 
algorithm has moved on.

Here's a real-life example, from Python itself: last September, on the 
Python-Dev mailing list, Chris Withers reported a problem downloading a 
file using Python's httplib module. For a file that took wget or 
Internet Explorer approximately 2 seconds to download, it took Python 
up to thirty MINUTES -- that's nearly a thousand times slower.

Eventually Chris, with the help of others on the list, debugged the 
problem down to a Shlemiel algorithm in the httplib code. Somebody 
found a comment in the _read_chunked method that said:

  # XXX This accumulates chunks by repeated string concatenation,
  # which is not efficient as the number or size of chunks gets big.

So somebody used an algorithm which they KNEW was inefficient and slow, 
it had been there for years, affecting who knows how many people, until 
eventually somebody was annoyed sufficiently to do something about it. 
And the sad thing is that avoiding repeated string concatenation is 
easy and there was no need for it in the first place.

> Otherwise, we'd all still be writing everything in C.

You've missed the point. It's not the language, you can write poorly 
performing code in any language, and an O(N) algorithm in a slow 
language will probably out-perform an O(N**2) or O(2**N) algorithm in a 
fast language.

-- 
Steven D'Aprano