[Tutor] Looking for duplicates within a list

Fri Jun 11 21:59:45 CEST 2010

I really shouldn't, but I'll bite just this once.

On Fri, Jun 11, 2010 at 7:28 PM, Steven D'Aprano <steve at pearwood.info> wrote:
>
> Your idea of elegant and simple is not the same as mine. To me, I see
> unnecessary work and unneeded complexity. A generator expression for a
> beginner having trouble with the basics? Calling mylist.count(item)
> *twice* for every item?! How is that possibly elegant?
>

What I mean by simple is that that line of code is the most obvious
and direct translation of "a list of every item and how often it
appears, for every item that appears more than once." I call it
elegant because the code is even shorter than that English description
while remaining readable, possibly even to someone with little
programming experience.

I'm making a judgement based on readability, not performance. Code
that goes for performance in lieu of readability I'd call "clever,"
and that's not a compliment.  Yes, it's an opinion. You disagree.
That's fine.

> [pedantic]
> Not everything. Some things are inherently slow, for example:
>
> http://en.wikipedia.org/wiki/Busy_beaver
>
> A five-state Busy Beaver machine takes 47,176,870 steps to return, and a
> six-state Busy Beaver takes 10**21132 steps to return.
>
> A slightly less extreme example is Ackermann's Function:
>
> http://en.wikipedia.org/wiki/Ackermann's_function
>
> where (for example) A(4,2) has over 19,000 digits. Calculating the
> number of recursive steps needed to calculate that value is left as an
> exercise.
> [/pedantic]
>
> These are extreme examples, but the point is that there are tasks which
> are hard even for small N. "Find N needles in a haystack" remains hard
> even for N=1.
>

Now you're just arguing for the sake of being right (points for
marking it [pedantic], I suppose). The busy beaver and Ackermann
function have no bearing on this discussion whatsoever. If we're going
to argue pedantically, the 1-state busy beaver (n=1) finishes in one
step, and A(1,0) = A(0, 1) = 2. Even those are fast for small n, they
just grow ridiculously quickly.

Also, (pedantically), the problem "find needles in a haystack" grows
in the size of the haystack, not the amount of needles. So the more
appropriate n=1 version would be a haystack of size 1 with a needle in
it. Which is quite easy.

> (And before you suggest using a magnet, it's a *plastic* needle.)

That solution is not in the spirit of the problem, and since I know
that, I wouldn't bring it up.

>
> The problem is that situations that won't be encountered often *are*
> encountered, long after the coder who introduced the poorly-performing
> algorithm has moved on.
>
> Here's a real-life example, from Python itself: last September, on the
> Python-Dev mailing list, Chris Withers reported a problem downloading a
> file using Python's httplib module. For a file that took wget or
> Internet Explorer approximately 2 seconds to download, it took Python
> up to thirty MINUTES -- that's nearly a thousand times slower.
>
> Eventually Chris, with the help of others on the list, debugged the
> problem down to a Shlemiel algorithm in the httplib code. Somebody
> found a comment in the _read_chunked method that said:
>
>   # XXX This accumulates chunks by repeated string concatenation,
>   # which is not efficient as the number or size of chunks gets big.
>
> So somebody used an algorithm which they KNEW was inefficient and slow,
> it had been there for years, affecting who knows how many people, until
> eventually somebody was annoyed sufficiently to do something about it.
> And the sad thing is that avoiding repeated string concatenation is
> easy and there was no need for it in the first place.
>

Estimating the size of your input is sometimes hard. That's a valid
point. OTOH, that line of code takes maybe 10 seconds to write, and
after profiling reveals it to be slow (you *are* profiling your code,
right?) you can easily replace it with something more appropriate. Or
your successor can, since he'll have no problem figuring out what it
does.

>
> You've missed the point. It's not the language, you can write poorly
> performing code in any language, and an O(N) algorithm in a slow
> language will probably out-perform an O(N**2) or O(2**N) algorithm in a
> fast language.
>

That wasn't the point I was trying to make. My point was that speed is
not always the primary concern, and if it is, you shouldn't be writing
code in python anyway, since it is always possible to write a program
in C that performs better.

Alan's code makes a trade-off between performance and readability. I'd
call it the easiest to read solution so far. That's a trade-off that
may or may not be appropriate. My point is that you shouldn't dismiss
the code just because the algorithm sucks. You should only dismiss it
because the algorithm sucks *and* your code will have to handle large
inputs.

Hugo