finding items that occur more than once in a list

Arnaud Delobelle arnodel at googlemail.com
Thu Mar 20 10:43:26 CET 2008


On Mar 19, 3:08 am, Bryan Olson <fakeaddr... at nowhere.org> wrote:
> Arnaud Delobelle wrote:
> > Ninereeds wrote:
> >> Hrvoje Niksic wrote:
> >>> This doesn't apply to Python, which implements dict storage as an
> >>> open-addressed table and automatically (and exponentially) grows the
> >>> table when the number of entries approaches 2/3 of the table size.
> >>> Assuming a good hash function, filling the dict should yield amortized
> >>> constant time for individual additions.
>
> > Isn't this average constant time rather than amortized?
>
> This is expected amortized constant time. Is that really the same
> thing as average constant time? Hmmm... that's subtle.

I am not sure what the difference between expected amortized time
complexity and average time complexity is (I know that they are
defined as different notions but I don't know whether they reduce to
the same thing or not).  Anyway both are average case complexities and
AFAIK worst case time complexity of insertion / lookup in a hashtable
is still O(n).

> >> OK. I obviously need to look up open-addressed tables. I thought this
> >> was just, in effect, implicit linked listing - ie it still needs a
> >> linear search to handle collisions, it just avoids the need for
> >> explicitly stored link fields. Perhaps I'm mistaken.
>
> The amortized doubling breaks that.
>
> > I don't think you are mistaken, but if I'm wrong I'd be grateful for a
> > link to further details.
>
> Arnaud Delobelle offered a good Wikipedia link, and for more background
> look up "amortized analysis.

Hrvoje Niksic provided the link :).  I still think two unrelated
things are being compared in this thread when people say that method A
(using dictionaries / sets) is O(n) and method B (sorting the list)
O(nlogn).

Isn't it the case that:

         | Worst case | Average case
---------|------------|-------------
Method A | O(n^2)     | O(n)
Method B | O(nlogn)   | O(nlogn)

Which method is the best would then be determined by the distribution
of the hash values of your data, and whether you want to have a
guarantee the method will have a less-than-quadratic behaviour.

--
Arnaud




More information about the Python-list mailing list