
Hello, folks. I'm working on compact and ordered set implementation. It has internal data structure similar to new dict from Python 3.6. It is still work in progress. Comments, tests, and documents should be updated. But it passes existing tests excluding test_sys and test_gdb (both tests checks implementation detail) https://github.com/methane/cpython/pull/16 Before completing this work, I want to evaluate it. Following is my current thoughts about the compact ordered set. ## Preserving insertion order Order is not fundamental for set. There are no order in set in the math world. But it is convenient sometime in real world. For example, it makes doctest easy. When writing set to logs, we can use "grep" command if print order is stable. pyc is stable without PYTHONHASHSEED=0 hack. Additionally, consistency with dict is desirable. It removes one pitfall for new Python users. "Remove duplicated items from list" idiom become `list(set(duplicated))` from `list(dict.fromkeys(duplicated))`. ## Memory efficiency Hash table has dilemma. To reduce collision rate, hash table should be sparse. But it wastes memory. Since current set is optimized for both of hit and miss cases, it is more sparse than dict. (It is bit surprise that set typically uses more memory than same size dict!) New implementation partially solve this dilemma. It has sparse "index table" which items are small (1byte when table size <= 256, 2bytes when table size <= 65536), and dense entry table (each item has key and hash, which is 16bytes on 64bit system). I use 1/2 for capacity rate for now. So new implementation is memory efficient when len(s) <= 32768. But memory efficiency is roughly equal to current implementation when 32768 < len(s) <= 2**31, and worse than current implementation when len(s) > 2**31. Here is quick test about memory usage. https://gist.github.com/methane/98b7f43fc00a84964f66241695112e91 # Performance pyperformance result: $ ./python -m perf compare_to master.json oset2.json -G --min-speed=2 Slower (3): - unpickle_list: 8.48 us +- 0.09 us -> 12.8 us +- 0.5 us: 1.52x slower (+52%) - unpickle: 29.6 us +- 2.5 us -> 44.1 us +- 2.5 us: 1.49x slower (+49%) - regex_dna: 448 ms +- 3 ms -> 462 ms +- 2 ms: 1.03x slower (+3%) Faster (4): - meteor_contest: 189 ms +- 1 ms -> 165 ms +- 1 ms: 1.15x faster (-13%) - telco: 15.8 ms +- 0.2 ms -> 15.3 ms +- 0.2 ms: 1.03x faster (-3%) - django_template: 266 ms +- 6 ms -> 259 ms +- 3 ms: 1.03x faster (-3%) - unpickle_pure_python: 818 us +- 6 us -> 801 us +- 9 us: 1.02x faster (-2%) Benchmark hidden because not significant (49) unpickle and unpickle_list shows massive slowdown. I suspect this slowdown is not caused from set change. Linux perf shows many pagefault is happened in pymalloc_malloc. I think memory usage changes hit weak point of pymalloc accidentally. I will try to investigate it. On the other hand, meteor_contest shows 13% speedup. It uses set. Other doesn't show significant performance changes. I need to write more benchmarks for various set workload. I expect new set is faster on simple creation, iteration and destruction. Especially, sequential iteration and deletion will reduce cache misses. (e.g. https://bugs.python.org/issue32846 ) On the other hand, new implementation will be slow on complex (heavy random add & del) case. ----- Any comments are welcome. And any benchmark for set workloads are very welcome. Regards, -- INADA Naoki <songofacandy@gmail.com>

Le mar. 26 févr. 2019 à 12:33, INADA Naoki <songofacandy@gmail.com> a écrit :
Please contact me to get access to speed.python.org server. *Maybe* your process to run benchmarks is not reliable and you are getting "noise" in results.
On the other hand, meteor_contest shows 13% speedup. It uses set. Other doesn't show significant performance changes.
I recall that some benchmarks are unstable and depend a lot on how you run the benchmark, how Python is compiled (ex: PGO or not). IMHO it's fine if the overall performance result is "no significant change", as soon as we reduce the memory footprint. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On Wed, Feb 27, 2019 at 12:37 AM Victor Stinner <vstinner@redhat.com> wrote:
My company gives me dedicated Linux machine with Core(TM) i7-6700. So I think it's not issue of my machine. perf shows this line caused many page fault. https://github.com/python/cpython/blob/c606a9cbd48f69d3f4a09204c781dda986421... This line is executed when pymalloc can't reuse existing pool and uses new pool. So I suspect there is some weak point about pymalloc and adding more hysteresis may help it. But I'm not sure yet. I'll investigate it later. If you want to reproduce it, try this commit. https://github.com/methane/cpython/pull/16/commits/3178dc96305435c691af83515... Ah, another interesting point, this huge slowdown happens only when bm_pickle.py is executed through pyperformance. When run it directly, slowdown is not so large. So I think this issue is tightly coupled with how pages are mapped. $ ./python -m performance.benchmarks.bm_pickle --compare-to ./py-master unpickle py-master: ..................... 27.7 us +- 1.8 us python: ..................... 28.7 us +- 2.5 us Mean +- std dev: [py-master] 27.7 us +- 1.8 us -> [python] 28.7 us +- 2.5 us: 1.04x slower (+4%)
As far as reading bm_meteor_contest.py source, it uses frozenset heavily. So I think this is real performance gain. Anyway, pyperformance is not perfect and doesn't cover all set workloads. I need to write more benchmarks. -- INADA Naoki <songofacandy@gmail.com>

Le mar. 26 févr. 2019 à 17:33, INADA Naoki <songofacandy@gmail.com> a écrit :
My company gives me dedicated Linux machine with Core(TM) i7-6700. So I think it's not issue of my machine.
Oh great :-)
You might want to try PYTHONMALLOC=malloc to force the usage of system malloc() and so disable pymalloc. You might also try jemalloc with LD_PRELOAD and PYTHONMALLOC=malloc. Not sure if it helps :-)
pyperformance runs benchmarks in a virtual environment. I don't know if it has any impact on bm_pickle. Most pyperformance can be run outside a virtual env if required modules are installed on the system. (bm_pickle only requires the stdlib and perf.) Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Ah, another interesting point, this huge slowdown happens only when
bm_pickle.py
Bingo! Without venv: unpickle: Mean +- std dev: 26.9 us +- 0.0 us % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 28.78 0.000438 0 1440 read 27.33 0.000416 1 440 25 stat 9.72 0.000148 1 144 mmap ... 0.79 0.000012 1 11 munmap With venv: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 57.12 0.099023 2 61471 munmap 41.87 0.072580 1 61618 mmap 0.23 0.000395 1 465 27 stat unpickle and unpickle_list creates massive same-sized objects, then all objects are removed. If all pools in the arena is freed, munmap is called. I think we should save some arenas to reuse. On recent Linux, we may be able to use MADV_FREE instead of munmap.

It happened very accidentally. Since venv is used, many paths in the interpreter is changed. So how memory is used are changed. Let's reproduce the accident. $ cat m2.py import pickle, sys LIST = pickle.dumps([[0]*10 for _ in range(10)], pickle.HIGHEST_PROTOCOL) N = 1000 z = [[0]*10 for _ in range(N)] if '-c' in sys.argv: sys._debugmallocstats() sys.exit() for _ in range(100000): pickle.loads(LIST) $ /usr/bin/time python3 m2.py 0.42user 0.00system 0:00.43elapsed 99%CPU (0avgtext+0avgdata 9100maxresident)k 0inputs+0outputs (0major+1139minor)pagefaults 0swaps There are only 1139 faults. It is less than 100000. $ /usr/bin/time python3 m2.py -c ... 14 unused pools * 4096 bytes = 57,344 ... adjust N im m2.py until it shows "0 unused pools". In my case, N=1390. $ /usr/bin/time python3 m2.py 0.51user 0.33system 0:00.85elapsed 99%CPU (0avgtext+0avgdata 9140maxresident)k 0inputs+0outputs (0major+201149minor)pagefaults 0swaps 200000 faults! It seems two page fault / loop. (2 pools are used and returned). On Wed, Feb 27, 2019 at 7:51 PM Victor Stinner <vstinner@redhat.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

Maybe pickle is inefficient in its memory management and causes a lot of memory fragmentation? It's hard to write an efficient memory allocator :-( My notes on memory: * "Excessive peak memory consumption by the Python parser" https://bugs.python.org/issue26415 * https://pythondev.readthedocs.io/memory.html * https://vstinner.readthedocs.io/heap_fragmentation.html Sometimes I would like to be able to use a separated memory allocator for one function, to not pollute the global allocator and so avoid "punching holes" in global memory pools or in the heap memory. The problem is to track the lifetime of objects. If the allocated objects live longer than the function, PyObject_Free() should be able to find the alllocator used by the memory block. pymalloc is already able to check if its manages a memory block using its address. If it's not allocated by pymalloc, PyObject_Free() falls back to libc free(). The Python parser already uses PyArena which is a custom memory allocator. It uses PyMem_Malloc() to allocate memory. In Python 3.7, PyMem_Malloc() uses pymalloc: https://docs.python.org/dev/c-api/memory.html#default-memory-allocators Victor Le mer. 27 févr. 2019 à 12:36, INADA Naoki <songofacandy@gmail.com> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

On Wed, Feb 27, 2019 at 9:59 PM Victor Stinner <vstinner@redhat.com> wrote:
Maybe pickle is inefficient in its memory management and causes a lot of memory fragmentation?
No, it is not relating to pickle efficiency and memory fragmentation. This problem is happened because pymalloc doesn't have no hysteresis between map & unmap arenas. Any workload creating some objects and release them soon may affected by this problem. When there are no free pool, pymalloc use mmap to create new arena (256KB). pymalloc allocates new pools (= pages) from the arena. It cause minor fault. Linux allocates real memory to the page and RSS is increased. Then, when all objects newly created in the pool are destroyed, all pools in the arena are free. pymalloc calls munmap() soon. unpickle can be affected by this problem than pure Python because: * unpickle is creating Python objects quickly. fault overhead is relatively large. * Python code may creates junk memory block (e.g. cached frame object, freeslot, etc) but C pickle code doesn't creates such junks. So newly allocated pools are freed very easily. I think this issue can be avoided easily. When arena is empty but the arena is head of usable_arenas, don't call munmap for it. I confirmed m2.py can't reproduce the issue with this patch. diff --git a/Objects/obmalloc.c b/Objects/obmalloc.c index 1c2a32050f..a19b3aca06 100644 --- a/Objects/obmalloc.c +++ b/Objects/obmalloc.c @@ -1672,7 +1672,7 @@ pymalloc_free(void *ctx, void *p) * nfreepools. * 4. Else there's nothing more to do. */ - if (nf == ao->ntotalpools) { + if (nf == ao->ntotalpools && ao != usable_arenas) { /* Case 1. First unlink ao from usable_arenas. */ assert(ao->prevarena == NULL || -- INADA Naoki <songofacandy@gmail.com>

I've also looked at this as well. Some thoughts: * Set objects have a different and conflicting optimization that works better for a broad range of use cases. In particular, there is a linear probing search step that gives excellent cache performance (multiple entries retrieved in a single cache line) and it reduces the cost of finding the next entry to a single increment (entry++). This greatly reduces the cost of collisions and makes it cheaper to verify an item is not in a set. * The technique for compaction involves making the key/hash entry array dense and augmenting it with a sparse array of indices. This necessarily involves adding a layer of indirection for every probe. * With the cache misses, branching costs, and extra layer of indirection, collisions would stop being cheap, so we would need to work to avoid them altogether. To get anything like the current performance for a collision of the first probe, I suspect we would have to lower the table density down from 60% to 25%. * The intersection operation has an important optimization where it loops over the smaller of its two inputs. To give a guaranteed order that preserves the order of the first input, you would have to forgo this optimization, possibly crippling any existing code that depends on it. * Maintaining order in the face of deletions adds a new workload to sets that didn't exist before. You risk thrashing the set support a feature that hasn't been asked for and that isn't warranted mathematically (where the notion of sets is unordered). * It takes a lot of care and planning to avoid fooling yourself with benchmarks on sets. Anything done with a small tight loop will tend to hide all branch prediction costs and cache miss costs, both of which are significant in real world uses of sets. * For sets, we care much more about look-up performance than space. And unlike dicts where we usually expect to find a key, sets are all about checking membership which means they have to be balanced for the case where the key is not in the set. * Having and preserving order is one of the least important things a set can offer (it does have some value, but it is the least important feature, one that was never contemplated by the original set PEP). After the success of the compact dict, I can understand an almost irresistible urge to apply the same technique to sets. If it was clear that it was a win, I would have already done it long ago, even before dicts (it was much harder to get buy in to changing the dicts). Please temper the enthusiasm with rationality and caution. The existing setobject code has been finely tuned and micro-optimized over the years, giving it excellent performance on workloads we care about. It would be easy throw all of that away. Raymond

Quick summary of what I found when I last ran experiments with this idea: * To get the same lookup performance, the density of index table would need to go down to around 25%. Otherwise, there's no way to make up for the extra indirection and the loss of cache locality. * There was a small win on iteration performance because its cheaper to loop over a dense array than a sparse array (fewer memory access and elimination of the unpredictable branch). This is nice because iteration performance matters in some key use cases. * I gave up on ordering right away. If we care about performance, keys can be stored in the order added; but no effort should be expended to maintain order if subsequent deletions occur. Likewise, to keep set-to-set operations efficient (i.e. looping over the smaller input), no order guarantee should be given for those operations. In general, we can let order happen but should not guarantee it and work to maintain it or slow-down essential operations to make them ordered. * Compacting does make sets a little smaller but does cost an indirection and incurs a cost for switching index sizes between 1-byte arrays, 2-byte arrays, 4-byte arrays, and 8-byte arrays. Those don't seem very expensive; however, set lookups are already very cheap when the hash values are known (when they're not, the cost of computing the hash value tends to dominate anything done by the setobject itself). * I couldn't find any existing application that would notice the benefit of making sets a bit smaller. Most applications use dictionaries (directly or indirectly) everywhere, so compaction was an overall win. Sets tend to be used more sparsely (no pun intended) and tend to be only a small part of overall memory usage. I had to consider this when bumping the load factor down to 60%, prioritizing speed over space. Raymond

On Feb 26, 2019, at 13:02, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
* I gave up on ordering right away. If we care about performance, keys can be stored in the order added; but no effort should be expended to maintain order if subsequent deletions occur. Likewise, to keep set-to-set operations efficient (i.e. looping over the smaller input), no order guarantee should be given for those operations. In general, we can let order happen but should not guarantee it and work to maintain it or slow-down essential operations to make them ordered.
One thing that concerns me would be if the ordering for sets is different than dictionaries. Well, it kind of is already, but it’s easier to say “dict preserve insertion order, sets are unordered”, than to say they are both ordered but with different guarantees. The behavior differences between dicts and sets is already surprising to many users, so we should be careful not to make the situation worse. -Barry

On Tue, Feb 26, 2019 at 3:43 PM Barry Warsaw <barry@python.org> wrote:
The behavior differences between dicts and sets is already surprising to many users, so we should be careful not to make the situation worse.
It's a nice to have, but other than the fact that we all used to use a dict when we really wanted a set before set existed, I'm not sure the connection is there to a layperson. A mapping and a set type really don't have much to do with each other other than implementation -- anyone that isn't familiar with python C code, or hash tables in general, wouldn't likely have any expectation of them having anything to do with each other. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Feb 27, 2019, at 13:54, Chris Barker via Python-Dev <python-dev@python.org> wrote:
A mapping and a set type really don't have much to do with each other other than implementation -- anyone that isn't familiar with python C code, or hash tables in general, wouldn't likely have any expectation of them having anything to do with each other.
I’m just relaying a data point. Some Python folks I’ve worked with do make the connection between dicts and sets, and have questions about the ordering guarantees of then (and how they relate). -Barry

If sets were ordered, then what ought pop() return - first, last, or nevertheless an arbitrary element? I lean toward arbitrary because in existing code, set.pop often implies that which particular element is immaterial. On Wed, Feb 27, 2019 at 2:18 PM Barry Warsaw <barry@python.org> wrote:

On Thu, Feb 28, 2019 at 7:23 AM Henry Chen <tahafut@gmail.com> wrote:
dict.popitem() pops last inserted pair. So set.pop() must remove last element. https://docs.python.org/3/library/stdtypes.html#dict.popitem -- INADA Naoki <songofacandy@gmail.com>

On Wed, Feb 27, 2019 at 02:15:53PM -0800, Barry Warsaw wrote:
Sets and dicts are not related by inheritence (except that they're both subclasses of ``object``, but so is everything else). They don't share an implementation. They don't provide the same API. They don't do the same thing, except in the most general sense that they are both collections. What connection are these folks making? If they're old-timers, or read some Python history, they might remember back in the ancient days when sets were implemented on top of dicts, or even before then, when we didn't have a set type at all and used dicts in an ad-hoc way. But apart from that long-obsolete historical connection, I think the two types are unrelated and the behaviour of one has little or no implications for the behaviour of the other. "Cars have windows that you can open. Submarines should have the same. They're both vehicles, right?" -- Steven

On Thu, 28 Feb 2019 22:43:04 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
Some of them may be coming from C++, where the respective characteristics of set and map (or unordered_set and unordered_multimap) are closely related. I'm sure other languages show similar analogies. On a more abstract level, set and dict are both content-addressed collections parametered on hash and equality functions. For algorithmically-minded people it makes sense to see a close connection between them. Regards Antoine.

Indeed e.g. Rust's hashset is a trivial wrapper around a hashmap (with no value): https://doc.rust- lang.org/src/std/collections/hash/set.rs.html#121-123, its btreeset has the exact same relationship to btreemap: https://doc.rust- lang.org/src/alloc/collections/btree/set.rs.html#72-74
I can't speak for anyone else but before seeing this thread I actually assumed (without any evidence or having checked obviously) that the set builtin was built on top of dict or that they were built on the same base and that the changes to dict's implementation in 3.6 (ordering, space, …) had affected sets in the same way. That seems intuitively straightforward, even more so with dict.keys() being a set.

On Thu, Feb 28, 2019 at 10:58 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
Looking from the opposite direction, sets and dicts can be used to solve a lot of the same problems. Want to detect cycles? Stuff things you see into a set. If the thing is in the set, you've already seen it. What if you want to track WHERE the cycle came from? Then stuff things you see into a dict, mapping them to some kind of trace information. Again, if the thing's in the collection, you've already seen it, but now, since it's a dict, you can pull up some more info. And collections.Counter can be used kinda like a multiset, but it's definitely a dictionary. I think the similarities are more pragmatic than pure, but that doesn't mean they aren't real. ChrisA

Antoine Pitrou wrote:
On a more abstract level, set and dict are both content-addressed collections parametered on hash and equality functions.
Indeed. It's been said that a set is like "half a dict", and this is why sets were implemented using dicts in the old days. It's kind of an obvious thing to do. You can argue about how far the analogy should be taken, but you can't blame people for noticing the similarity. -- Greg

ordered set 2 Hi, Raymond. Thank you for your detailed response and I'm sorry about the late reply. All of your points make sense to me. My implementation has not been battle-tested yet. As I wrote in a previous mail, there is only one benchmark in pyperformance was affected significantly. (My implementation was 13% faster.) After that, I wrote some microbenchmarks. Some run faster on the current implementation and some run faster on the ordered implementation. https://github.com/methane/sandbox/tree/master/python/ordered-set For example, current implementation is at most 20% faster on creating new set by consecutive integers. (e.g. `set(range(N))`) As you know, it is because hash(int) is consecutive too. On the other hand, ordered implementation is at most 17% faster on creating new set by string values. But I don't know how this microbenchmark results affects real world set workloads. Pyperformance doesn't have enough variations of set worklods. Would you please tell me how did you gather vary set workloads?
* To get the same lookup performance, the density of index table would need to go down to around 25%. Otherwise, there's no way to make up for the extra indirection and the loss of cache locality.
Yes. Currently, I chose capacity ratio=50%, and I removed 4X resize on small sets. So density is about 25~50% for now. Performance of simple lookup is 5~8% slower. More small capacity ratio may improve performance a bit, but it tends worse memory efficiency when table is 32bit or 64bit.
* There was a small win on iteration performance because its cheaper to loop over a dense array than a sparse array (fewer memory access and elimination of the unpredictable branch). This is nice because iteration performance matters in some key use cases.
Yes. And it can be huge performance benefit on extreme cases. (e.g. https://bugs.python.org/issue32846)
* I gave up on ordering right away. If we care about performance, keys can be stored in the order added; but no effort should be expended to maintain order if subsequent deletions occur. Likewise, to keep set-to-set operations efficient (i.e. looping over the smaller input), no order guarantee should be given for those operations. In general, we can let order happen but should not guarantee it and work to maintain it or slow-down essential operations to make them ordered.
I agree. Maybe, set shouldn't guarantee preserving insertion order unlike dict. It loses almost half of benefit of new implementation :-( But order is stable in most cases anyway regardless hash randomization. Stable pyc can be compiled without PYTHONHASHSEED=0, and sets in logs are almost stable and readable.
* Compacting does make sets a little smaller but does cost an indirection and incurs a cost for switching index sizes between 1-byte arrays, 2-byte arrays, 4-byte arrays, and 8-byte arrays. Those don't seem very expensive; however, set lookups are already very cheap when the hash values are known (when they're not, the cost of computing the hash value tends to dominate anything done by the setobject itself).
Yes. Ordered implementation makes simple case slower. But memory efficiency is very different if application uses tons of small sets.
* I couldn't find any existing application that would notice the benefit of making sets a bit smaller. Most applications use dictionaries (directly or indirectly) everywhere, so compaction was an overall win. Sets tend to be used more sparsely (no pun intended) and tend to be only a small part of overall memory usage. I had to consider this when bumping the load factor down to 60%, prioritizing speed over space.
You're right. Number of set objects in Python interpreter is very few than dict. And ordered set is not memory efficient for large set. Maybe, we couldn't find clear net win by this set implementation. I will stop this work at some point. Thank you, -- Inada Naoki <songofacandy@gmail.com>

Le mar. 26 févr. 2019 à 12:33, INADA Naoki <songofacandy@gmail.com> a écrit :
Please contact me to get access to speed.python.org server. *Maybe* your process to run benchmarks is not reliable and you are getting "noise" in results.
On the other hand, meteor_contest shows 13% speedup. It uses set. Other doesn't show significant performance changes.
I recall that some benchmarks are unstable and depend a lot on how you run the benchmark, how Python is compiled (ex: PGO or not). IMHO it's fine if the overall performance result is "no significant change", as soon as we reduce the memory footprint. Victor -- Night gathers, and now my watch begins. It shall not end until my death.

On Wed, Feb 27, 2019 at 12:37 AM Victor Stinner <vstinner@redhat.com> wrote:
My company gives me dedicated Linux machine with Core(TM) i7-6700. So I think it's not issue of my machine. perf shows this line caused many page fault. https://github.com/python/cpython/blob/c606a9cbd48f69d3f4a09204c781dda986421... This line is executed when pymalloc can't reuse existing pool and uses new pool. So I suspect there is some weak point about pymalloc and adding more hysteresis may help it. But I'm not sure yet. I'll investigate it later. If you want to reproduce it, try this commit. https://github.com/methane/cpython/pull/16/commits/3178dc96305435c691af83515... Ah, another interesting point, this huge slowdown happens only when bm_pickle.py is executed through pyperformance. When run it directly, slowdown is not so large. So I think this issue is tightly coupled with how pages are mapped. $ ./python -m performance.benchmarks.bm_pickle --compare-to ./py-master unpickle py-master: ..................... 27.7 us +- 1.8 us python: ..................... 28.7 us +- 2.5 us Mean +- std dev: [py-master] 27.7 us +- 1.8 us -> [python] 28.7 us +- 2.5 us: 1.04x slower (+4%)
As far as reading bm_meteor_contest.py source, it uses frozenset heavily. So I think this is real performance gain. Anyway, pyperformance is not perfect and doesn't cover all set workloads. I need to write more benchmarks. -- INADA Naoki <songofacandy@gmail.com>

Le mar. 26 févr. 2019 à 17:33, INADA Naoki <songofacandy@gmail.com> a écrit :
My company gives me dedicated Linux machine with Core(TM) i7-6700. So I think it's not issue of my machine.
Oh great :-)
You might want to try PYTHONMALLOC=malloc to force the usage of system malloc() and so disable pymalloc. You might also try jemalloc with LD_PRELOAD and PYTHONMALLOC=malloc. Not sure if it helps :-)
pyperformance runs benchmarks in a virtual environment. I don't know if it has any impact on bm_pickle. Most pyperformance can be run outside a virtual env if required modules are installed on the system. (bm_pickle only requires the stdlib and perf.) Victor -- Night gathers, and now my watch begins. It shall not end until my death.

Ah, another interesting point, this huge slowdown happens only when
bm_pickle.py
Bingo! Without venv: unpickle: Mean +- std dev: 26.9 us +- 0.0 us % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 28.78 0.000438 0 1440 read 27.33 0.000416 1 440 25 stat 9.72 0.000148 1 144 mmap ... 0.79 0.000012 1 11 munmap With venv: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 57.12 0.099023 2 61471 munmap 41.87 0.072580 1 61618 mmap 0.23 0.000395 1 465 27 stat unpickle and unpickle_list creates massive same-sized objects, then all objects are removed. If all pools in the arena is freed, munmap is called. I think we should save some arenas to reuse. On recent Linux, we may be able to use MADV_FREE instead of munmap.

It happened very accidentally. Since venv is used, many paths in the interpreter is changed. So how memory is used are changed. Let's reproduce the accident. $ cat m2.py import pickle, sys LIST = pickle.dumps([[0]*10 for _ in range(10)], pickle.HIGHEST_PROTOCOL) N = 1000 z = [[0]*10 for _ in range(N)] if '-c' in sys.argv: sys._debugmallocstats() sys.exit() for _ in range(100000): pickle.loads(LIST) $ /usr/bin/time python3 m2.py 0.42user 0.00system 0:00.43elapsed 99%CPU (0avgtext+0avgdata 9100maxresident)k 0inputs+0outputs (0major+1139minor)pagefaults 0swaps There are only 1139 faults. It is less than 100000. $ /usr/bin/time python3 m2.py -c ... 14 unused pools * 4096 bytes = 57,344 ... adjust N im m2.py until it shows "0 unused pools". In my case, N=1390. $ /usr/bin/time python3 m2.py 0.51user 0.33system 0:00.85elapsed 99%CPU (0avgtext+0avgdata 9140maxresident)k 0inputs+0outputs (0major+201149minor)pagefaults 0swaps 200000 faults! It seems two page fault / loop. (2 pools are used and returned). On Wed, Feb 27, 2019 at 7:51 PM Victor Stinner <vstinner@redhat.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

Maybe pickle is inefficient in its memory management and causes a lot of memory fragmentation? It's hard to write an efficient memory allocator :-( My notes on memory: * "Excessive peak memory consumption by the Python parser" https://bugs.python.org/issue26415 * https://pythondev.readthedocs.io/memory.html * https://vstinner.readthedocs.io/heap_fragmentation.html Sometimes I would like to be able to use a separated memory allocator for one function, to not pollute the global allocator and so avoid "punching holes" in global memory pools or in the heap memory. The problem is to track the lifetime of objects. If the allocated objects live longer than the function, PyObject_Free() should be able to find the alllocator used by the memory block. pymalloc is already able to check if its manages a memory block using its address. If it's not allocated by pymalloc, PyObject_Free() falls back to libc free(). The Python parser already uses PyArena which is a custom memory allocator. It uses PyMem_Malloc() to allocate memory. In Python 3.7, PyMem_Malloc() uses pymalloc: https://docs.python.org/dev/c-api/memory.html#default-memory-allocators Victor Le mer. 27 févr. 2019 à 12:36, INADA Naoki <songofacandy@gmail.com> a écrit :
-- Night gathers, and now my watch begins. It shall not end until my death.

On Wed, Feb 27, 2019 at 9:59 PM Victor Stinner <vstinner@redhat.com> wrote:
Maybe pickle is inefficient in its memory management and causes a lot of memory fragmentation?
No, it is not relating to pickle efficiency and memory fragmentation. This problem is happened because pymalloc doesn't have no hysteresis between map & unmap arenas. Any workload creating some objects and release them soon may affected by this problem. When there are no free pool, pymalloc use mmap to create new arena (256KB). pymalloc allocates new pools (= pages) from the arena. It cause minor fault. Linux allocates real memory to the page and RSS is increased. Then, when all objects newly created in the pool are destroyed, all pools in the arena are free. pymalloc calls munmap() soon. unpickle can be affected by this problem than pure Python because: * unpickle is creating Python objects quickly. fault overhead is relatively large. * Python code may creates junk memory block (e.g. cached frame object, freeslot, etc) but C pickle code doesn't creates such junks. So newly allocated pools are freed very easily. I think this issue can be avoided easily. When arena is empty but the arena is head of usable_arenas, don't call munmap for it. I confirmed m2.py can't reproduce the issue with this patch. diff --git a/Objects/obmalloc.c b/Objects/obmalloc.c index 1c2a32050f..a19b3aca06 100644 --- a/Objects/obmalloc.c +++ b/Objects/obmalloc.c @@ -1672,7 +1672,7 @@ pymalloc_free(void *ctx, void *p) * nfreepools. * 4. Else there's nothing more to do. */ - if (nf == ao->ntotalpools) { + if (nf == ao->ntotalpools && ao != usable_arenas) { /* Case 1. First unlink ao from usable_arenas. */ assert(ao->prevarena == NULL || -- INADA Naoki <songofacandy@gmail.com>

I've also looked at this as well. Some thoughts: * Set objects have a different and conflicting optimization that works better for a broad range of use cases. In particular, there is a linear probing search step that gives excellent cache performance (multiple entries retrieved in a single cache line) and it reduces the cost of finding the next entry to a single increment (entry++). This greatly reduces the cost of collisions and makes it cheaper to verify an item is not in a set. * The technique for compaction involves making the key/hash entry array dense and augmenting it with a sparse array of indices. This necessarily involves adding a layer of indirection for every probe. * With the cache misses, branching costs, and extra layer of indirection, collisions would stop being cheap, so we would need to work to avoid them altogether. To get anything like the current performance for a collision of the first probe, I suspect we would have to lower the table density down from 60% to 25%. * The intersection operation has an important optimization where it loops over the smaller of its two inputs. To give a guaranteed order that preserves the order of the first input, you would have to forgo this optimization, possibly crippling any existing code that depends on it. * Maintaining order in the face of deletions adds a new workload to sets that didn't exist before. You risk thrashing the set support a feature that hasn't been asked for and that isn't warranted mathematically (where the notion of sets is unordered). * It takes a lot of care and planning to avoid fooling yourself with benchmarks on sets. Anything done with a small tight loop will tend to hide all branch prediction costs and cache miss costs, both of which are significant in real world uses of sets. * For sets, we care much more about look-up performance than space. And unlike dicts where we usually expect to find a key, sets are all about checking membership which means they have to be balanced for the case where the key is not in the set. * Having and preserving order is one of the least important things a set can offer (it does have some value, but it is the least important feature, one that was never contemplated by the original set PEP). After the success of the compact dict, I can understand an almost irresistible urge to apply the same technique to sets. If it was clear that it was a win, I would have already done it long ago, even before dicts (it was much harder to get buy in to changing the dicts). Please temper the enthusiasm with rationality and caution. The existing setobject code has been finely tuned and micro-optimized over the years, giving it excellent performance on workloads we care about. It would be easy throw all of that away. Raymond

Quick summary of what I found when I last ran experiments with this idea: * To get the same lookup performance, the density of index table would need to go down to around 25%. Otherwise, there's no way to make up for the extra indirection and the loss of cache locality. * There was a small win on iteration performance because its cheaper to loop over a dense array than a sparse array (fewer memory access and elimination of the unpredictable branch). This is nice because iteration performance matters in some key use cases. * I gave up on ordering right away. If we care about performance, keys can be stored in the order added; but no effort should be expended to maintain order if subsequent deletions occur. Likewise, to keep set-to-set operations efficient (i.e. looping over the smaller input), no order guarantee should be given for those operations. In general, we can let order happen but should not guarantee it and work to maintain it or slow-down essential operations to make them ordered. * Compacting does make sets a little smaller but does cost an indirection and incurs a cost for switching index sizes between 1-byte arrays, 2-byte arrays, 4-byte arrays, and 8-byte arrays. Those don't seem very expensive; however, set lookups are already very cheap when the hash values are known (when they're not, the cost of computing the hash value tends to dominate anything done by the setobject itself). * I couldn't find any existing application that would notice the benefit of making sets a bit smaller. Most applications use dictionaries (directly or indirectly) everywhere, so compaction was an overall win. Sets tend to be used more sparsely (no pun intended) and tend to be only a small part of overall memory usage. I had to consider this when bumping the load factor down to 60%, prioritizing speed over space. Raymond

On Feb 26, 2019, at 13:02, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
* I gave up on ordering right away. If we care about performance, keys can be stored in the order added; but no effort should be expended to maintain order if subsequent deletions occur. Likewise, to keep set-to-set operations efficient (i.e. looping over the smaller input), no order guarantee should be given for those operations. In general, we can let order happen but should not guarantee it and work to maintain it or slow-down essential operations to make them ordered.
One thing that concerns me would be if the ordering for sets is different than dictionaries. Well, it kind of is already, but it’s easier to say “dict preserve insertion order, sets are unordered”, than to say they are both ordered but with different guarantees. The behavior differences between dicts and sets is already surprising to many users, so we should be careful not to make the situation worse. -Barry

On Tue, Feb 26, 2019 at 3:43 PM Barry Warsaw <barry@python.org> wrote:
The behavior differences between dicts and sets is already surprising to many users, so we should be careful not to make the situation worse.
It's a nice to have, but other than the fact that we all used to use a dict when we really wanted a set before set existed, I'm not sure the connection is there to a layperson. A mapping and a set type really don't have much to do with each other other than implementation -- anyone that isn't familiar with python C code, or hash tables in general, wouldn't likely have any expectation of them having anything to do with each other. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Feb 27, 2019, at 13:54, Chris Barker via Python-Dev <python-dev@python.org> wrote:
A mapping and a set type really don't have much to do with each other other than implementation -- anyone that isn't familiar with python C code, or hash tables in general, wouldn't likely have any expectation of them having anything to do with each other.
I’m just relaying a data point. Some Python folks I’ve worked with do make the connection between dicts and sets, and have questions about the ordering guarantees of then (and how they relate). -Barry

If sets were ordered, then what ought pop() return - first, last, or nevertheless an arbitrary element? I lean toward arbitrary because in existing code, set.pop often implies that which particular element is immaterial. On Wed, Feb 27, 2019 at 2:18 PM Barry Warsaw <barry@python.org> wrote:

On Thu, Feb 28, 2019 at 7:23 AM Henry Chen <tahafut@gmail.com> wrote:
dict.popitem() pops last inserted pair. So set.pop() must remove last element. https://docs.python.org/3/library/stdtypes.html#dict.popitem -- INADA Naoki <songofacandy@gmail.com>

On Wed, Feb 27, 2019 at 02:15:53PM -0800, Barry Warsaw wrote:
Sets and dicts are not related by inheritence (except that they're both subclasses of ``object``, but so is everything else). They don't share an implementation. They don't provide the same API. They don't do the same thing, except in the most general sense that they are both collections. What connection are these folks making? If they're old-timers, or read some Python history, they might remember back in the ancient days when sets were implemented on top of dicts, or even before then, when we didn't have a set type at all and used dicts in an ad-hoc way. But apart from that long-obsolete historical connection, I think the two types are unrelated and the behaviour of one has little or no implications for the behaviour of the other. "Cars have windows that you can open. Submarines should have the same. They're both vehicles, right?" -- Steven

On Thu, 28 Feb 2019 22:43:04 +1100 Steven D'Aprano <steve@pearwood.info> wrote:
Some of them may be coming from C++, where the respective characteristics of set and map (or unordered_set and unordered_multimap) are closely related. I'm sure other languages show similar analogies. On a more abstract level, set and dict are both content-addressed collections parametered on hash and equality functions. For algorithmically-minded people it makes sense to see a close connection between them. Regards Antoine.

Indeed e.g. Rust's hashset is a trivial wrapper around a hashmap (with no value): https://doc.rust- lang.org/src/std/collections/hash/set.rs.html#121-123, its btreeset has the exact same relationship to btreemap: https://doc.rust- lang.org/src/alloc/collections/btree/set.rs.html#72-74
I can't speak for anyone else but before seeing this thread I actually assumed (without any evidence or having checked obviously) that the set builtin was built on top of dict or that they were built on the same base and that the changes to dict's implementation in 3.6 (ordering, space, …) had affected sets in the same way. That seems intuitively straightforward, even more so with dict.keys() being a set.

On Thu, Feb 28, 2019 at 10:58 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
Looking from the opposite direction, sets and dicts can be used to solve a lot of the same problems. Want to detect cycles? Stuff things you see into a set. If the thing is in the set, you've already seen it. What if you want to track WHERE the cycle came from? Then stuff things you see into a dict, mapping them to some kind of trace information. Again, if the thing's in the collection, you've already seen it, but now, since it's a dict, you can pull up some more info. And collections.Counter can be used kinda like a multiset, but it's definitely a dictionary. I think the similarities are more pragmatic than pure, but that doesn't mean they aren't real. ChrisA

Antoine Pitrou wrote:
On a more abstract level, set and dict are both content-addressed collections parametered on hash and equality functions.
Indeed. It's been said that a set is like "half a dict", and this is why sets were implemented using dicts in the old days. It's kind of an obvious thing to do. You can argue about how far the analogy should be taken, but you can't blame people for noticing the similarity. -- Greg

ordered set 2 Hi, Raymond. Thank you for your detailed response and I'm sorry about the late reply. All of your points make sense to me. My implementation has not been battle-tested yet. As I wrote in a previous mail, there is only one benchmark in pyperformance was affected significantly. (My implementation was 13% faster.) After that, I wrote some microbenchmarks. Some run faster on the current implementation and some run faster on the ordered implementation. https://github.com/methane/sandbox/tree/master/python/ordered-set For example, current implementation is at most 20% faster on creating new set by consecutive integers. (e.g. `set(range(N))`) As you know, it is because hash(int) is consecutive too. On the other hand, ordered implementation is at most 17% faster on creating new set by string values. But I don't know how this microbenchmark results affects real world set workloads. Pyperformance doesn't have enough variations of set worklods. Would you please tell me how did you gather vary set workloads?
* To get the same lookup performance, the density of index table would need to go down to around 25%. Otherwise, there's no way to make up for the extra indirection and the loss of cache locality.
Yes. Currently, I chose capacity ratio=50%, and I removed 4X resize on small sets. So density is about 25~50% for now. Performance of simple lookup is 5~8% slower. More small capacity ratio may improve performance a bit, but it tends worse memory efficiency when table is 32bit or 64bit.
* There was a small win on iteration performance because its cheaper to loop over a dense array than a sparse array (fewer memory access and elimination of the unpredictable branch). This is nice because iteration performance matters in some key use cases.
Yes. And it can be huge performance benefit on extreme cases. (e.g. https://bugs.python.org/issue32846)
* I gave up on ordering right away. If we care about performance, keys can be stored in the order added; but no effort should be expended to maintain order if subsequent deletions occur. Likewise, to keep set-to-set operations efficient (i.e. looping over the smaller input), no order guarantee should be given for those operations. In general, we can let order happen but should not guarantee it and work to maintain it or slow-down essential operations to make them ordered.
I agree. Maybe, set shouldn't guarantee preserving insertion order unlike dict. It loses almost half of benefit of new implementation :-( But order is stable in most cases anyway regardless hash randomization. Stable pyc can be compiled without PYTHONHASHSEED=0, and sets in logs are almost stable and readable.
* Compacting does make sets a little smaller but does cost an indirection and incurs a cost for switching index sizes between 1-byte arrays, 2-byte arrays, 4-byte arrays, and 8-byte arrays. Those don't seem very expensive; however, set lookups are already very cheap when the hash values are known (when they're not, the cost of computing the hash value tends to dominate anything done by the setobject itself).
Yes. Ordered implementation makes simple case slower. But memory efficiency is very different if application uses tons of small sets.
* I couldn't find any existing application that would notice the benefit of making sets a bit smaller. Most applications use dictionaries (directly or indirectly) everywhere, so compaction was an overall win. Sets tend to be used more sparsely (no pun intended) and tend to be only a small part of overall memory usage. I had to consider this when bumping the load factor down to 60%, prioritizing speed over space.
You're right. Number of set objects in Python interpreter is very few than dict. And ordered set is not memory efficient for large set. Maybe, we couldn't find clear net win by this set implementation. I will stop this work at some point. Thank you, -- Inada Naoki <songofacandy@gmail.com>
participants (12)
-
Antoine Pitrou
-
Barry Warsaw
-
Chris Angelico
-
Chris Barker
-
Greg Ewing
-
Henry Chen
-
Inada Naoki
-
INADA Naoki
-
Raymond Hettinger
-
Steven D'Aprano
-
Victor Stinner
-
Xavier Morel