Cache memory and its effect on list searching
steve+python at pearwood.info
Sat Dec 17 01:12:03 EST 2016
On Sat, 17 Dec 2016 03:55 pm, Chris Angelico wrote:
> More than that: the lists are searched in linear time, the binary
> seach runs in logarithmic time, but the set lookup is constant time.
> Doesn't matter how big your set is, you can test for membership with
> one hash lookup.
To be pedantic: it is constant time on average.
Best case is one hash and one equality test. Worst case is if all the set
elements have colliding hashes, in which case it degenerates to a linear
There is a class of denial of service attacks where the attacker can specify
the keys in a dict in such a way that lookups collide, and can push new
colliding items into the dictionary (or set) faster than Python can perform
the lookups. That's why Python now has hash randomisation:
But outside of contrived or hostile examples, where elements of the set are
designed to collide, typically you would expect hash collisions to be rare,
and long chains of colliding elements even rarer. For random elements,
most will require only a single hash and a single equality test, a small
number might require two tests, three would be even rarer, and so forth.
So strictly speaking, and I realise I'm being exceedingly pedantic here, a
*sufficiently large* dict or set MUST have colliding elements. How large
is "sufficiently large"? Its at least 2**31, more than two billion, so
while you are right that *in practice* set/dict lookups require only a
single hash + equality, in principle (and sometimes in practice too)
collisions can be significant.
Nevertheless, I mention this only for completeness. In practice, you almost
never need to care about hash collisions except for the most pathological
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list