On Dec 11, 2012, at 1:13 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

On Dec 10, 2012, at 2:48 AM, Christian Heimes <christian@python.org>
wrote:

On the other hand every lookup and collision checks needs an
additional multiplication, addition and pointer dereferencing:

entry = entries_baseaddr + sizeof(PyDictKeyEntry) * idx

Currently, the dict implementation allows alternative lookup
functions based on whether the keys are all strings.
The choice of lookup function is stored in a function pointer.
That lets each lookup use the currently active lookup
function without having to make any computations or branches.

An indirect function call is technically a branch, as seen from the CPU
(and not necessarily a very predictable one, although recent Intel
CPUs are said to be quite good at that).

FWIW, we already have an indirection to the lookup function.

I would piggyback on that, so no new indirections are required.

My plan now is to apply the space compaction idea to sets.

That code is less complex than dicts, and set operations

stand to benefit the most from improved iteration speed.

The steps would be:

* Create a good set of benchmarks for set operations

for both size and speed.

* Start with the simplest version of the idea: separate the

entries table from the hash table. Keep the hash table at

Py_ssize_t, and pre-allocate the entry array to two-thirds the size

of the hash table. This should give about a 25% space savings

and speed-up iteration for all the set-to-set operations.

* If that all works out, I want to trim the entry table for frozensefs

so that the entry table has no over-allocations. This should

give a much larger space savings.

* Next, I want to experiment with the idea of using char/short/long

sizes for the hash table. Since there is already an existing

lookup indirection, this can be done with no additional overhead.

Small sets will get the most benefit for the space savings and

the cache performance for hash lookups should improve nicely

(for a hash table of size 32 could fit in a single cache line).

At each step, I'll run the benchmarks to make sure the expected

speed and space benefits are being achieved.

As a side-effect, sets without deletions will retain their insertion

order. If this is of concern, it would be no problem to implement

Antoine's idea of scrambling the entries during iteration.

Raymond

P.S. I've gotten a lot of suggestions for improvements to the

proof-of-concept code. Thank you for that. The latest version

is at: http://code.activestate.com/recipes/578375/

In that code, entries are stored in regular Python lists

and inherit their over-allocation characteristics (about

12.5% overallocated for large lists). There are many

other possible allocation strategies with their own

respective speed/space trade-offs.