[Tutor] Looking for duplicates within a list
Steven D'Aprano
steve at pearwood.info
Fri Jun 11 19:28:30 CEST 2010
On Sat, 12 Jun 2010 02:18:52 am Hugo Arts wrote:
> On 11 jun 2010, at 17:49, Steven D'Aprano <steve at pearwood.info> wrote:
> > On Sat, 12 Jun 2010 12:58:19 am Alan Gauld wrote:
> >> Have you looked at the count method of lists?
> >>
> >> Something like:
> >>
> >> counts = set(( item, mylist.count(item)) for item in mylist if
> >> mylist.count(item) > 1)
> >
> > That's a Shlemiel the Painter algorithm.
> >
> > http://www.joelonsoftware.com/articles/fog0000000319.html
>
> It is, but it's also very elegant and simple to understand.
Your idea of elegant and simple is not the same as mine. To me, I see
unnecessary work and unneeded complexity. A generator expression for a
beginner having trouble with the basics? Calling mylist.count(item)
*twice* for every item?! How is that possibly elegant?
> And it
> works fine on small inputs. Everything is fast for small n.
[pedantic]
Not everything. Some things are inherently slow, for example:
http://en.wikipedia.org/wiki/Busy_beaver
A five-state Busy Beaver machine takes 47,176,870 steps to return, and a
six-state Busy Beaver takes 10**21132 steps to return.
A slightly less extreme example is Ackermann's Function:
http://en.wikipedia.org/wiki/Ackermann's_function
where (for example) A(4,2) has over 19,000 digits. Calculating the
number of recursive steps needed to calculate that value is left as an
exercise.
[/pedantic]
These are extreme examples, but the point is that there are tasks which
are hard even for small N. "Find N needles in a haystack" remains hard
even for N=1.
(And before you suggest using a magnet, it's a *plastic* needle.)
[...]
> I guess what I'm trying to say is: using code that performs bad in
> situations that won't be encountered anyway is not inherently bad.
The problem is that situations that won't be encountered often *are*
encountered, long after the coder who introduced the poorly-performing
algorithm has moved on.
Here's a real-life example, from Python itself: last September, on the
Python-Dev mailing list, Chris Withers reported a problem downloading a
file using Python's httplib module. For a file that took wget or
Internet Explorer approximately 2 seconds to download, it took Python
up to thirty MINUTES -- that's nearly a thousand times slower.
Eventually Chris, with the help of others on the list, debugged the
problem down to a Shlemiel algorithm in the httplib code. Somebody
found a comment in the _read_chunked method that said:
# XXX This accumulates chunks by repeated string concatenation,
# which is not efficient as the number or size of chunks gets big.
So somebody used an algorithm which they KNEW was inefficient and slow,
it had been there for years, affecting who knows how many people, until
eventually somebody was annoyed sufficiently to do something about it.
And the sad thing is that avoiding repeated string concatenation is
easy and there was no need for it in the first place.
> Otherwise, we'd all still be writing everything in C.
You've missed the point. It's not the language, you can write poorly
performing code in any language, and an O(N) algorithm in a slow
language will probably out-perform an O(N**2) or O(2**N) algorithm in a
fast language.
--
Steven D'Aprano
More information about the Tutor
mailing list