I'm working on compact and ordered set implementation. It has internal data structure similar to new dict from Python 3.6.
It is still work in progress. Comments, tests, and documents should be updated. But it passes existing tests excluding test_sys and test_gdb (both tests checks implementation detail) https://github.com/methane/cpython/pull/16
Before completing this work, I want to evaluate it. Following is my current thoughts about the compact ordered set.
## Preserving insertion order
Order is not fundamental for set. There are no order in set in the math world.
But it is convenient sometime in real world. For example, it makes doctest easy. When writing set to logs, we can use "grep" command if print order is stable. pyc is stable without PYTHONHASHSEED=0 hack.
Additionally, consistency with dict is desirable. It removes one pitfall for new Python users. "Remove duplicated items from list" idiom become `list(set(duplicated))` from `list(dict.fromkeys(duplicated))`.
## Memory efficiency
Hash table has dilemma. To reduce collision rate, hash table should be sparse. But it wastes memory.
Since current set is optimized for both of hit and miss cases, it is more sparse than dict. (It is bit surprise that set typically uses more memory than same size dict!)
New implementation partially solve this dilemma. It has sparse "index table" which items are small (1byte when table size <= 256, 2bytes when table size <= 65536), and dense entry table (each item has key and hash, which is 16bytes on 64bit system).
I use 1/2 for capacity rate for now. So new implementation is memory efficient when len(s) <= 32768. But memory efficiency is roughly equal to current implementation when 32768 < len(s) <= 2**31, and worse than current implementation when len(s) > 2**31.
Here is quick test about memory usage. https://gist.github.com/methane/98b7f43fc00a84964f66241695112e91
$ ./python -m perf compare_to master.json oset2.json -G --min-speed=2 Slower (3): - unpickle_list: 8.48 us +- 0.09 us -> 12.8 us +- 0.5 us: 1.52x slower (+52%) - unpickle: 29.6 us +- 2.5 us -> 44.1 us +- 2.5 us: 1.49x slower (+49%) - regex_dna: 448 ms +- 3 ms -> 462 ms +- 2 ms: 1.03x slower (+3%)
Faster (4): - meteor_contest: 189 ms +- 1 ms -> 165 ms +- 1 ms: 1.15x faster (-13%) - telco: 15.8 ms +- 0.2 ms -> 15.3 ms +- 0.2 ms: 1.03x faster (-3%) - django_template: 266 ms +- 6 ms -> 259 ms +- 3 ms: 1.03x faster (-3%) - unpickle_pure_python: 818 us +- 6 us -> 801 us +- 9 us: 1.02x faster (-2%)
Benchmark hidden because not significant (49)
unpickle and unpickle_list shows massive slowdown. I suspect this slowdown is not caused from set change. Linux perf shows many pagefault is happened in pymalloc_malloc. I think memory usage changes hit weak point of pymalloc accidentally. I will try to investigate it.
On the other hand, meteor_contest shows 13% speedup. It uses set. Other doesn't show significant performance changes.
I need to write more benchmarks for various set workload. I expect new set is faster on simple creation, iteration and destruction. Especially, sequential iteration and deletion will reduce cache misses. (e.g. https://bugs.python.org/issue32846 )
On the other hand, new implementation will be slow on complex (heavy random add & del) case.
Any comments are welcome. And any benchmark for set workloads are very welcome.
Regards, -- INADA Naoki email@example.com