Is 100,000 entries big for a dictionary?

Mon Jan 1 19:50:10 EST 2001

In article <mailman.978392642.24494.python-list at python.org>,
  "Tim Peters" <tim.one at home.com> wrote:
> .. Python's hash table structure is utterly vanilla,
> a contiguous vector of 2**N structs -- it's just that
> sometimes a brand new vector 2x bigger .. may get
> allocated and everything moves to the new vector. ..

An alternate method, at each split, is to allocate
a new vector the same size as the sum of all previous,
and to move half the existing entries (those whose
next hash bit is 1) into the new vector. This saves
half the cost of reinsertion, but it requires a table
of offsets for the sequence of vectors. Since the
vectors grow exponentially, this table is small.

> .. All references to extensible (more often "extendible")
> hashing I've seen have a much fancier scheme in mind,
> with a directory (or index) table pointing to buckets
> that can split independently; the thrust of such schemes
> is to ensure that an item can be found with no more than
> two disk accesses (incl. one for the directory table, if
> it's paged out).  I believe that's the std meaning, and
> it doesn't have anything in common with Python's scheme
> beyond hash functions and powers of 2 ..

As I view it, the core ideas of extensible hashing are
(a) that a single hash function can index different
sizes of hash space by using the first m high-order bits,
and (b) when this is done, the memory boundaries for the
different spaces align in a way that supports a variety
of strategies for merge and split. This can be done for
in-memory tables, as I described above, or for indexed
tables on mass storage. In the latter case, the index
stores the mapping of address space to disk, possibly
with collapse of some parts of the hash space space, ie,
2^n pointers may reference the same disk bucket.

Russell

Sent via Deja.com
http://www.deja.com/