It's true that the results you found aren't consistent with O(1), but as I understand it, Python dicts are O(1) amortized ("on average over the long term"). Sometimes dicts resize, which is not a constant time operation, and sometimes the dict has to walk a short linked list, which depends on the proportion of hashes that lead to a collisions.
Thanks for the insight. I didn't know such a process is considered O(1). I agree it is fair because in practice it seems to work very well, but the "collisions" part can turn it into O(N) as demonstrated (in a crude way) by the script below. Therefore the O(1) classification is a bit misleading. My other script was simplified. I did look at more data points. The curve is amazingly flat but not a constant function. It is frustrating that simple requests like wanting a useful True or False, that costs nothing, instead of a useless None, tend to digress like this ... import os class normal(object): def __init__(O, value): O.value = value def __hash__(O): return O.value.__hash__() class bad(object): def __init__(O, value): O.value = value def __hash__(O): return 0 for N in [1000, 10000]: t0 = os.times()[0] s = set([normal(i) for i in xrange(N)]) print os.times()[0]-t0 assert len(s) == N t0 = os.times()[0] s = set([bad(i) for i in xrange(N)]) print os.times()[0]-t0 assert len(s) == N