Fwd: Re: Experimenting with dict performance, and an immutable dict

On Wed, 22 Jul 2020 at 15:35, Antoine Pitrou <solipsis@pitrou.net> wrote:
The deltas also look small for a micro-benchmark. I certainly don't think this is a sufficient reason to add a new datatype to Python.
I think some of the optimizations can be experimented with dict itself. Deltas seem to be small, because I divide them by the number of timeit runs. I think that a 30-45% gain in constructor speed is not something small. The benchmarks in question have the Name that starts with "constructor". PS: sorry for double mail, I sent this message to only you by mistake. On Wed, 22 Jul 2020 at 18:26, Guido van Rossum <guido@python.org> wrote:
Did you study PEP 416 (frozendict) and PEP 603 (frozenmap)?
Yes. About frozenmap, at the very start I added to the benchmarks also immutables.Map, but I removed it when I realized that it was slower than frozendict in every bench. Maybe this is because immutables.Map is a C extension and not a builtin type. About PEP 416, yes, I tried to follow it. Indeed hash is calculated using the strategy described in the PEP. I also take a look to frozenset and tuple code. Frankly I must admit that the rejection of PEP 416 was a good idea. Indeed I started to implement an immutable dict only for a symmetry reason, and because I am somewhat fascinated by functional programming and immutables, without a rational reason. I suppose I would have given up and admitted that a frozendict is quite useless in Python, if I did not see that the constructor speed can be faster. I'm not sure 100% of the result. This is why I'm suggesting to try to implement the speed optimizations to the constructor of dict first: 1. the optimizations could be not safe. Indeed recently I got a segfault that I have to investigate 2. maybe dict can be really optimized further. After that, the difference between dict and frozendict performance could be minimal That said, maybe there are four good use cases for a frozendict: 1. they could be used for __annotations__ of function objects, and similar cases 2. they could be used to implement "immutable" modules or classes 3. frozendict can be easily cached, like tuples. 4. as I said, as an alternative to MappingProxyType, since it's very slow and it's used also in the C code of CPython and in the Python stdlib. This maybe is not useful because: a. the speed of MappingProxyType can be improved b. MappingProxyType can't be replaced because you *really* want and need a proxy

On Wed, Jul 22, 2020 at 9:25 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote: <...>
Because you didn't benchmark the case that matters for PEP 603: mutation. E.g. for a regular dict that would be "d1 = d.copy(); d1[key] = new_value", meaning "I need to mutate my dict keeping its old version intact". Here's the benchmark results for this operation for PEP 603: https://github.com/MagicStack/immutables/blob/master/README.rst (note that "regular dict" means "d[key] = new_value", i.e. without copying).
Maybe this is because immutables.Map is a C extension and not a builtin type.
frozenmap implements a fundamentally different data structure, that's the reason. Its data structure has O(log N) mutations as opposed to O(n) for a naive frozen dict. All other operations (like lookups) while also O(1) are slightly slower than Python dict's equivalent operations.
About PEP 416, yes, I tried to follow it. Indeed hash is calculated using the strategy described in the PEP. I also take a look to frozenset and tuple code.
Frankly I must admit that the rejection of PEP 416 was a good idea. Indeed I started to implement an immutable dict only for a symmetry reason, and because I am somewhat fascinated by functional programming and immutables, without a rational reason.
If you're fascinated by functional programming then your version of the frozen dict should support efficient (not "O(n)") mutations. That's the defining characteristic of hashmaps in languages like clojure. Otherwise it doesn't take too much to freeze a dict (prohibit mutations), but such dicts are only useful because they are hashable. That's the reason why popular libraries like https://immutable-js.github.io/immutable-js/ implement data structures similar to PEP 603. Yury

On Thu, 23 Jul 2020 at 07:02, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
If you're fascinated by functional programming then your version of the frozen dict should support efficient (not "O(n)") mutations.
Well, I tried it. No O(n), but the result is quite interesting. Name: `o.set(key, value)`; Size: 8; Keys: str; Type: frozendict; Time: 1.43e-07; Sigma: 6e-09 Name: `o.set(key, value)`; Size: 8; Keys: str; Type: Map; Time: 1.44e-07; Sigma: 6e-08 //////////////////////////////////////////////////////////////////////////////// Name: `o.set(key, value)`; Size: 8; Keys: int; Type: frozendict; Time: 1.44e-07; Sigma: 5e-09 Name: `o.set(key, value)`; Size: 8; Keys: int; Type: Map; Time: 1.46e-07; Sigma: 6e-08 //////////////////////////////////////////////////////////////////////////////// Name: `o.set(key, value)`; Size: 1000; Keys: str; Type: frozendict; Time: 8.67e-06; Sigma: 3e-07 Name: `o.set(key, value)`; Size: 1000; Keys: str; Type: Map; Time: 2.47e-07; Sigma: 7e-08 //////////////////////////////////////////////////////////////////////////////// Name: `o.set(key, value)`; Size: 1000; Keys: int; Type: frozendict; Time: 6.21e-06; Sigma: 2e-07 Name: `o.set(key, value)`; Size: 1000; Keys: int; Type: Map; Time: 2.89e-07; Sigma: 8e-09 For small dictionaries, it seems that there's almost no difference between imap and fdict. Indeed dict is optimized for size equal or smaller than 8. As expected the difference grows with the size of the dictionary. About mutate(), in theory fdict could return simply a dict, and dict could have a freeze() method that returns a frozendict. Since dict is quite fast on writing, it's difficult to compare the performances. This is the source code of the benchmark: https://github.com/Marco-Sulla/cpython/blob/master/frozendict/test/bench_imm... This is the diff: https://github.com/python/cpython/compare/24618c5df45dc1096604e7d1dc90835965... <https://github.com/Marco-Sulla/cpython/commit/7c6cec1bf74439dbaeb2a1e6750e31...> Maybe immutables.Map can be further optimized. For example, "set()" could use METH_FASTCALL. The only problem is that METH_FASTCALL is not part of the limited API. I think that a frozenmap is quite interesting, but a frozendict could have the advantage to be fast as dict on reading, and could also have not-so-slow methods to modify itself for common cases. Furthermore, since it has the same struct of dict, it can be accessed by the internal optimized functions. I also remembered another possible use-case: kwargs in CPython. In C code, kwargs are PyDictObjects. I suppose they are usually not modified; if so, fdict could be used, since it seems to be faster at creation. On Thu, 23 Jul 2020 at 07:44, Wes Turner <wes.turner@gmail.com> wrote:
pyrsistent.PMap and PRecord may be worth a look
Thank you, I'll check it.

On Sat, Jul 25, 2020 at 12:45 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
That's an interesting idea. It has always vaguely bothered me that `*args` gives a tuple while `**kwds` gives a dict. Unfortunately it's impossible to change without breaking tons of existing code, since things like `kwds.pop("something")` have become a pretty standard idiom to use it. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Sat, Jul 25, 2020 at 12:58 PM Guido van Rossum <guido@python.org> wrote:
And just to be clear for folks that don't know the idiom (it took me a second to remember myself), it's common to do something like: def spam(**kwargs): """Do 'something', then call bacon().""" if 'something' in kwargs: use_something(kwargs.pop('something') bacon(**kwargs) def bacon(**kwargs): ... You see this in decorators, for instance, that add an extra argument to do some pre-processing or something before calling the decorated function.

On Sun, Jul 26, 2020 at 4:44 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
I also remembered another possible use-case: kwargs in CPython. In C code, kwargs are PyDictObjects. I suppose they are usually not modified; if so, fdict could be used, since it seems to be faster at creation.
I have not confirmed why frozendict is faster than dict for creation. But note that kwargs is not created by `dict(d)`. It is created by PyDict_New() and PyDict_SetItem(). Please benchmark these C APIs if you want to propose the idea. https://github.com/python/cpython/blob/a74eea238f5baba15797e2e8b570d153bc869... https://github.com/python/cpython/blob/a74eea238f5baba15797e2e8b570d153bc869... FWIW, I optimized dict(d) in https://bugs.python.org/issue41431 (https://github.com/python/cpython/pull/21674 ) $ ./python -m pyperf timeit --compare-to ./python-master -s 'd=dict.fromkeys(range(1000))' -- 'dict(d)' python-master: ..................... 21.5 us +- 0.2 us python: ..................... 4.52 us +- 0.16 us Mean +- std dev: [python-master] 21.5 us +- 0.2 us -> [python] 4.52 us +- 0.16 us: 4.76x faster (-79%) Regards, -- Inada Naoki <songofacandy@gmail.com>

On Wed, 29 Jul 2020 at 06:41, Inada Naoki <songofacandy@gmail.com> wrote:
Great! On Wed, 29 Jul 2020 at 06:41, Inada Naoki <songofacandy@gmail.com> wrote:
Yes, I was really thinking about PyDict_New(). On Sat, 25 Jul 2020 at 21:55, Guido van Rossum <guido@python.org> wrote:
Well, I was thinking about CPython code. Anyway, maybe there's a way to not break old code: def hello(name: str, **kwargs: frozendict) -> str: # code In the meanwhile I tried to optimize iteration. It seems also that iteration can be faster: https://github.com/Marco-Sulla/cpython/commit/3d802ba227eb588b0608f31cfa2357... I also done some other changes later, but for some reason I get a segfault now: https://github.com/Marco-Sulla/cpython/commit/51e2c45a5b7fa36f47aa55f2b4426c...

@Inada (senpai?): In the meanwhile I tried to apply some tricks to speedup the dict creation without success. This is my effort: https://github.com/Marco-Sulla/cpython/blob/master/Objects/dictobject.c As you can see, I simply "cloned" some functions that create dict, and removed some checks that are not needed at creation: 1. the "override" flag 2. the check for resizing, since in theory I would do one big resize at start (but I have some problems, see below) 3. the introduction of an "empty" parameter, that is false only if you have to insert key-value pairs AND you also have a positional parameter 4. some random const, when the code permits it 5. changing of ma_used, ma_version_tag, dk_usable, dk_nentries only one time, after the for loop that insert all the items 6. some little shortcut (for example, PyDictKeysObject *keys = mp->ma_keys; and using keys in the rest of the code) 7. changing some checks in asserts 8. calling directly the internal static functions I tested it with a dict with an "hole", since I noticed that you already improved the creation of dict when the source it's another dict and it's "perfect". The speedup is a mere 2.5%... Furthermore, I failed to do one big resize with dict. I was ok with frozendict because I always used dicts or seq2 as parameters. But defaultdict passes a class as first argument. So PyObject_Length fails miserably, and I had to remove the resize. Anyway, I didn't reintroduce the resize, so in theory the code should be faster (even if one odict test fails, of course). Some questions: 1. How can we check the size of an object only if it's an iterable using the Python C API? 2. Why, in your opinion, no relevant speedup was done? PS: My next and final try is to improve the speed of lookups, the only other field where frozendict is faster. I hope to have better luck here.

On Tue, Sep 15, 2020 at 5:08 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
1. How can we check the size of an object only if it's an iterable using the Python C API?
There is no good way. Additionally, we need to know distinct count if we want to preallocate hash table. For example, `len(dict(["foo"]*1000))` is 1, not 1000.
2. Why, in your opinion, no relevant speedup was done?
We have "one big resize" logic in dict_merge already. And I use dummy empty dictkeys for new empty dict. So we don't allocate any temporary, intermediate dictkey object. Bests, -- Inada Naoki <songofacandy@gmail.com>

On Tue, 15 Sep 2020 at 09:22, Inada Naoki <songofacandy@gmail.com> wrote:
Well, yes, but only for the first positional argument and if it's a map. I would be able to resize to the maximum possible size, that is len(arg) + len(kwarg). Of course the size can be overestimated, but I suppose the overlaps are very rare and small. The problem is that if I do this resize in dict_new, when the compilation does python -E -S -m sysconfig --generate-posix-vars a segfault happens. If I reintroduce the temporary dummy keys, it works.

Well, I simply forgot vectorcall. I've done a test and it seems the speedup is about 25%. Unluckily, now I abandoned definitively the hope of a big big resize, but it seems I don't resize correctly, since now I have again an assert error on dk_usable. What is the difference between dk_usable and USABLE_FRACTION?

Well, it seems ok now: https://github.com/python/cpython/compare/master...Marco-Sulla:master I've done a quick speed test and speedup is quite high for a creation using keywods or a dict with "holes": about 30%: python -m timeit -n 2000 --setup "from uuid import uuid4 ; o = {str(uuid4()).replace('-', '') : str(uuid4()).replace('-', '') for i in range(10000)}" "dict(**o)" python -m timeit -n 10000 --setup "from uuid import uuid4 ; o = {str(uuid4()).replace('-', '') : str(uuid4()).replace('-', '') for i in range(10000)} ; it = iter(o) ; key0 = next(it) ; o.pop(key0)" "dict(o)" Can I do a PR?

That sounds like a worthwhile optimization. FWIW, is this a bit simpler but sufficient?: python -m timeit -n 2000 --setup "from uuid import uuid4; \ o = {uuid4().hex: i for i in range(10000)}" \ "dict(**o)" Is there a preferred tool to comprehensively measure the performance impact of a PR (with e.g. multiple contrived and average-case key/value sets)? On Wed, Sep 16, 2020, 7:07 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:

Would an e.g. bm_dict.py in [1] be a good place for a few benchmarks of dict; or is there a more appropriate project for authoritatively measuring performance regressions and optimizations of core {cpython,} data structures? [1] https://github.com/python/pyperformance/tree/master/pyperformance/benchmarks (pytest-benchmark looks neat, as well. an example of how to use pytest.mark.parametrize to capture multiple metrics might be helpful: https://github.com/ionelmc/pytest-benchmark ) Its easy to imagine a bot that runs some or all performance benchmarks on a PR when requested in a PR comment; there's probably already a good way to do this? On Wed, Sep 16, 2020, 10:44 PM Wes Turner <wes.turner@gmail.com> wrote:

On Thu, Sep 17, 2020 at 8:03 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
30% on microbenchmark is not quite high. For example, I have optimized "copy dict with holes" but I rejected my PR because I am not sure performance / maintenance cost ratio is good enough. https://bugs.python.org/issue41431#msg374556 https://github.com/python/cpython/pull/21669
I don't think this use case is worth to optimize, because `dict(o)` or `o.copy()` is Pythonic.
It is controversial. If the optimization is very simple, it might be worth enough. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Thu, 17 Sep 2020 at 05:31, Inada Naoki <songofacandy@gmail.com> wrote:
Well, also {**dict1, **dict2} is pythonic. Anyway, I used **dict as a shortcut for testing keyword assignment. For doing this I "only" cloned PyDict_SetItem and insertdict. I do not like code duplication, but dictobject.c has already a lot of duplicated copies of the same function for optimization (see lookdict).

I've done a PR: https://github.com/python/cpython/pull/22346 As you can see, changes are not dramatical, if you improve only kw creation. Furthermore, IMHO insert_to_emptydict() can be removed, since it speeds up the insertion of the first element, but slows down all the others. I do something similar in insertdict_init, but in "bulk mode". On Thu, 17 Sep 2020 at 16:49, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:

On Thu, Sep 17, 2020 at 11:50 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
lookdict is very special case. It is used for namespace lookup. It is the performance critical part of CPython. We should take care of cost/merit ratio always. Regards, -- Inada Naoki <songofacandy@gmail.com>

In the meanwhile, I updated the code of frozendict to the new 3.10 code. And here I need some help. As you can see by the new benchs: https://github.com/Marco-Sulla/cpython/blob/frozendict/frozendict/test/bench... creation of frozendict is not faster anymore. This is because Inada introduced memcpy to do a fast init of a dict from a "good" dict. I tried to copy the code for frozendict, but I get a good old memory corruption. I passed several hours to understand what I'm doing wrong, without success. The code is the one commented out at lines 1084 and 2978, starting with a "it does not work" comment... https://github.com/Marco-Sulla/cpython/blob/frozendict/Objects/dictobject.c

Well, it seems to work now, but the creation from a combined, without holes, dict, is definitively faster no more. On the contrary, it is 2x slower. This is probably because 1. combined dicts needs only one memcpy 2. probably I wrong something in my code, since I need TWO Py_INCREF for keys and values: https://github.com/Marco-Sulla/cpython/blob/2eea9ff796685127fc03fcc30ff6c652... (It's frozendict_clone and it's used in frozendict_merge) Iteration continues to be faster. Probably also creation from dict with holes, I did not test it. I suppose frozendict can improve memory space using shared keys and shared frozendicts. Probably I'll try to write a C extension, even if I'll need a lot of help, in another mailing list. I have some random remarks about possible improvements to dict performance: A. lookdict functions that are unicode only could return zero immediately if the searched key is not a string, instead of using the basic lookdict B. every time the dict is changed, the keys could be checked if they are all unicode with an internal version of _PyDict_HasOnlyStringKeys (I created it for frozendict, it's dict_has_only_unicode_keys_exact) and change the dk_lookup accordingly. If the function changes only one value, it's sufficient to check if the value if unicode or not, and the original lookup func C. can USABLE_FRACTION be substituted with mp->ma_keys->dk_usable?

I forgot: I noticed that creating dict from a matrix N*2 is not optimized for lists and tuples. Is this because creation from a sequence2 is not much used?

On Wed, Jul 22, 2020 at 9:25 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote: <...>
Because you didn't benchmark the case that matters for PEP 603: mutation. E.g. for a regular dict that would be "d1 = d.copy(); d1[key] = new_value", meaning "I need to mutate my dict keeping its old version intact". Here's the benchmark results for this operation for PEP 603: https://github.com/MagicStack/immutables/blob/master/README.rst (note that "regular dict" means "d[key] = new_value", i.e. without copying).
Maybe this is because immutables.Map is a C extension and not a builtin type.
frozenmap implements a fundamentally different data structure, that's the reason. Its data structure has O(log N) mutations as opposed to O(n) for a naive frozen dict. All other operations (like lookups) while also O(1) are slightly slower than Python dict's equivalent operations.
About PEP 416, yes, I tried to follow it. Indeed hash is calculated using the strategy described in the PEP. I also take a look to frozenset and tuple code.
Frankly I must admit that the rejection of PEP 416 was a good idea. Indeed I started to implement an immutable dict only for a symmetry reason, and because I am somewhat fascinated by functional programming and immutables, without a rational reason.
If you're fascinated by functional programming then your version of the frozen dict should support efficient (not "O(n)") mutations. That's the defining characteristic of hashmaps in languages like clojure. Otherwise it doesn't take too much to freeze a dict (prohibit mutations), but such dicts are only useful because they are hashable. That's the reason why popular libraries like https://immutable-js.github.io/immutable-js/ implement data structures similar to PEP 603. Yury

On Thu, 23 Jul 2020 at 07:02, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
If you're fascinated by functional programming then your version of the frozen dict should support efficient (not "O(n)") mutations.
Well, I tried it. No O(n), but the result is quite interesting. Name: `o.set(key, value)`; Size: 8; Keys: str; Type: frozendict; Time: 1.43e-07; Sigma: 6e-09 Name: `o.set(key, value)`; Size: 8; Keys: str; Type: Map; Time: 1.44e-07; Sigma: 6e-08 //////////////////////////////////////////////////////////////////////////////// Name: `o.set(key, value)`; Size: 8; Keys: int; Type: frozendict; Time: 1.44e-07; Sigma: 5e-09 Name: `o.set(key, value)`; Size: 8; Keys: int; Type: Map; Time: 1.46e-07; Sigma: 6e-08 //////////////////////////////////////////////////////////////////////////////// Name: `o.set(key, value)`; Size: 1000; Keys: str; Type: frozendict; Time: 8.67e-06; Sigma: 3e-07 Name: `o.set(key, value)`; Size: 1000; Keys: str; Type: Map; Time: 2.47e-07; Sigma: 7e-08 //////////////////////////////////////////////////////////////////////////////// Name: `o.set(key, value)`; Size: 1000; Keys: int; Type: frozendict; Time: 6.21e-06; Sigma: 2e-07 Name: `o.set(key, value)`; Size: 1000; Keys: int; Type: Map; Time: 2.89e-07; Sigma: 8e-09 For small dictionaries, it seems that there's almost no difference between imap and fdict. Indeed dict is optimized for size equal or smaller than 8. As expected the difference grows with the size of the dictionary. About mutate(), in theory fdict could return simply a dict, and dict could have a freeze() method that returns a frozendict. Since dict is quite fast on writing, it's difficult to compare the performances. This is the source code of the benchmark: https://github.com/Marco-Sulla/cpython/blob/master/frozendict/test/bench_imm... This is the diff: https://github.com/python/cpython/compare/24618c5df45dc1096604e7d1dc90835965... <https://github.com/Marco-Sulla/cpython/commit/7c6cec1bf74439dbaeb2a1e6750e31...> Maybe immutables.Map can be further optimized. For example, "set()" could use METH_FASTCALL. The only problem is that METH_FASTCALL is not part of the limited API. I think that a frozenmap is quite interesting, but a frozendict could have the advantage to be fast as dict on reading, and could also have not-so-slow methods to modify itself for common cases. Furthermore, since it has the same struct of dict, it can be accessed by the internal optimized functions. I also remembered another possible use-case: kwargs in CPython. In C code, kwargs are PyDictObjects. I suppose they are usually not modified; if so, fdict could be used, since it seems to be faster at creation. On Thu, 23 Jul 2020 at 07:44, Wes Turner <wes.turner@gmail.com> wrote:
pyrsistent.PMap and PRecord may be worth a look
Thank you, I'll check it.

On Sat, Jul 25, 2020 at 12:45 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
That's an interesting idea. It has always vaguely bothered me that `*args` gives a tuple while `**kwds` gives a dict. Unfortunately it's impossible to change without breaking tons of existing code, since things like `kwds.pop("something")` have become a pretty standard idiom to use it. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Sat, Jul 25, 2020 at 12:58 PM Guido van Rossum <guido@python.org> wrote:
And just to be clear for folks that don't know the idiom (it took me a second to remember myself), it's common to do something like: def spam(**kwargs): """Do 'something', then call bacon().""" if 'something' in kwargs: use_something(kwargs.pop('something') bacon(**kwargs) def bacon(**kwargs): ... You see this in decorators, for instance, that add an extra argument to do some pre-processing or something before calling the decorated function.

On Sun, Jul 26, 2020 at 4:44 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
I also remembered another possible use-case: kwargs in CPython. In C code, kwargs are PyDictObjects. I suppose they are usually not modified; if so, fdict could be used, since it seems to be faster at creation.
I have not confirmed why frozendict is faster than dict for creation. But note that kwargs is not created by `dict(d)`. It is created by PyDict_New() and PyDict_SetItem(). Please benchmark these C APIs if you want to propose the idea. https://github.com/python/cpython/blob/a74eea238f5baba15797e2e8b570d153bc869... https://github.com/python/cpython/blob/a74eea238f5baba15797e2e8b570d153bc869... FWIW, I optimized dict(d) in https://bugs.python.org/issue41431 (https://github.com/python/cpython/pull/21674 ) $ ./python -m pyperf timeit --compare-to ./python-master -s 'd=dict.fromkeys(range(1000))' -- 'dict(d)' python-master: ..................... 21.5 us +- 0.2 us python: ..................... 4.52 us +- 0.16 us Mean +- std dev: [python-master] 21.5 us +- 0.2 us -> [python] 4.52 us +- 0.16 us: 4.76x faster (-79%) Regards, -- Inada Naoki <songofacandy@gmail.com>

On Wed, 29 Jul 2020 at 06:41, Inada Naoki <songofacandy@gmail.com> wrote:
Great! On Wed, 29 Jul 2020 at 06:41, Inada Naoki <songofacandy@gmail.com> wrote:
Yes, I was really thinking about PyDict_New(). On Sat, 25 Jul 2020 at 21:55, Guido van Rossum <guido@python.org> wrote:
Well, I was thinking about CPython code. Anyway, maybe there's a way to not break old code: def hello(name: str, **kwargs: frozendict) -> str: # code In the meanwhile I tried to optimize iteration. It seems also that iteration can be faster: https://github.com/Marco-Sulla/cpython/commit/3d802ba227eb588b0608f31cfa2357... I also done some other changes later, but for some reason I get a segfault now: https://github.com/Marco-Sulla/cpython/commit/51e2c45a5b7fa36f47aa55f2b4426c...

@Inada (senpai?): In the meanwhile I tried to apply some tricks to speedup the dict creation without success. This is my effort: https://github.com/Marco-Sulla/cpython/blob/master/Objects/dictobject.c As you can see, I simply "cloned" some functions that create dict, and removed some checks that are not needed at creation: 1. the "override" flag 2. the check for resizing, since in theory I would do one big resize at start (but I have some problems, see below) 3. the introduction of an "empty" parameter, that is false only if you have to insert key-value pairs AND you also have a positional parameter 4. some random const, when the code permits it 5. changing of ma_used, ma_version_tag, dk_usable, dk_nentries only one time, after the for loop that insert all the items 6. some little shortcut (for example, PyDictKeysObject *keys = mp->ma_keys; and using keys in the rest of the code) 7. changing some checks in asserts 8. calling directly the internal static functions I tested it with a dict with an "hole", since I noticed that you already improved the creation of dict when the source it's another dict and it's "perfect". The speedup is a mere 2.5%... Furthermore, I failed to do one big resize with dict. I was ok with frozendict because I always used dicts or seq2 as parameters. But defaultdict passes a class as first argument. So PyObject_Length fails miserably, and I had to remove the resize. Anyway, I didn't reintroduce the resize, so in theory the code should be faster (even if one odict test fails, of course). Some questions: 1. How can we check the size of an object only if it's an iterable using the Python C API? 2. Why, in your opinion, no relevant speedup was done? PS: My next and final try is to improve the speed of lookups, the only other field where frozendict is faster. I hope to have better luck here.

On Tue, Sep 15, 2020 at 5:08 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
1. How can we check the size of an object only if it's an iterable using the Python C API?
There is no good way. Additionally, we need to know distinct count if we want to preallocate hash table. For example, `len(dict(["foo"]*1000))` is 1, not 1000.
2. Why, in your opinion, no relevant speedup was done?
We have "one big resize" logic in dict_merge already. And I use dummy empty dictkeys for new empty dict. So we don't allocate any temporary, intermediate dictkey object. Bests, -- Inada Naoki <songofacandy@gmail.com>

On Tue, 15 Sep 2020 at 09:22, Inada Naoki <songofacandy@gmail.com> wrote:
Well, yes, but only for the first positional argument and if it's a map. I would be able to resize to the maximum possible size, that is len(arg) + len(kwarg). Of course the size can be overestimated, but I suppose the overlaps are very rare and small. The problem is that if I do this resize in dict_new, when the compilation does python -E -S -m sysconfig --generate-posix-vars a segfault happens. If I reintroduce the temporary dummy keys, it works.

Well, I simply forgot vectorcall. I've done a test and it seems the speedup is about 25%. Unluckily, now I abandoned definitively the hope of a big big resize, but it seems I don't resize correctly, since now I have again an assert error on dk_usable. What is the difference between dk_usable and USABLE_FRACTION?

Well, it seems ok now: https://github.com/python/cpython/compare/master...Marco-Sulla:master I've done a quick speed test and speedup is quite high for a creation using keywods or a dict with "holes": about 30%: python -m timeit -n 2000 --setup "from uuid import uuid4 ; o = {str(uuid4()).replace('-', '') : str(uuid4()).replace('-', '') for i in range(10000)}" "dict(**o)" python -m timeit -n 10000 --setup "from uuid import uuid4 ; o = {str(uuid4()).replace('-', '') : str(uuid4()).replace('-', '') for i in range(10000)} ; it = iter(o) ; key0 = next(it) ; o.pop(key0)" "dict(o)" Can I do a PR?

That sounds like a worthwhile optimization. FWIW, is this a bit simpler but sufficient?: python -m timeit -n 2000 --setup "from uuid import uuid4; \ o = {uuid4().hex: i for i in range(10000)}" \ "dict(**o)" Is there a preferred tool to comprehensively measure the performance impact of a PR (with e.g. multiple contrived and average-case key/value sets)? On Wed, Sep 16, 2020, 7:07 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:

Would an e.g. bm_dict.py in [1] be a good place for a few benchmarks of dict; or is there a more appropriate project for authoritatively measuring performance regressions and optimizations of core {cpython,} data structures? [1] https://github.com/python/pyperformance/tree/master/pyperformance/benchmarks (pytest-benchmark looks neat, as well. an example of how to use pytest.mark.parametrize to capture multiple metrics might be helpful: https://github.com/ionelmc/pytest-benchmark ) Its easy to imagine a bot that runs some or all performance benchmarks on a PR when requested in a PR comment; there's probably already a good way to do this? On Wed, Sep 16, 2020, 10:44 PM Wes Turner <wes.turner@gmail.com> wrote:

On Thu, Sep 17, 2020 at 8:03 AM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
30% on microbenchmark is not quite high. For example, I have optimized "copy dict with holes" but I rejected my PR because I am not sure performance / maintenance cost ratio is good enough. https://bugs.python.org/issue41431#msg374556 https://github.com/python/cpython/pull/21669
I don't think this use case is worth to optimize, because `dict(o)` or `o.copy()` is Pythonic.
It is controversial. If the optimization is very simple, it might be worth enough. Regards, -- Inada Naoki <songofacandy@gmail.com>

On Thu, 17 Sep 2020 at 05:31, Inada Naoki <songofacandy@gmail.com> wrote:
Well, also {**dict1, **dict2} is pythonic. Anyway, I used **dict as a shortcut for testing keyword assignment. For doing this I "only" cloned PyDict_SetItem and insertdict. I do not like code duplication, but dictobject.c has already a lot of duplicated copies of the same function for optimization (see lookdict).

I've done a PR: https://github.com/python/cpython/pull/22346 As you can see, changes are not dramatical, if you improve only kw creation. Furthermore, IMHO insert_to_emptydict() can be removed, since it speeds up the insertion of the first element, but slows down all the others. I do something similar in insertdict_init, but in "bulk mode". On Thu, 17 Sep 2020 at 16:49, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:

On Thu, Sep 17, 2020 at 11:50 PM Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
lookdict is very special case. It is used for namespace lookup. It is the performance critical part of CPython. We should take care of cost/merit ratio always. Regards, -- Inada Naoki <songofacandy@gmail.com>

In the meanwhile, I updated the code of frozendict to the new 3.10 code. And here I need some help. As you can see by the new benchs: https://github.com/Marco-Sulla/cpython/blob/frozendict/frozendict/test/bench... creation of frozendict is not faster anymore. This is because Inada introduced memcpy to do a fast init of a dict from a "good" dict. I tried to copy the code for frozendict, but I get a good old memory corruption. I passed several hours to understand what I'm doing wrong, without success. The code is the one commented out at lines 1084 and 2978, starting with a "it does not work" comment... https://github.com/Marco-Sulla/cpython/blob/frozendict/Objects/dictobject.c

Well, it seems to work now, but the creation from a combined, without holes, dict, is definitively faster no more. On the contrary, it is 2x slower. This is probably because 1. combined dicts needs only one memcpy 2. probably I wrong something in my code, since I need TWO Py_INCREF for keys and values: https://github.com/Marco-Sulla/cpython/blob/2eea9ff796685127fc03fcc30ff6c652... (It's frozendict_clone and it's used in frozendict_merge) Iteration continues to be faster. Probably also creation from dict with holes, I did not test it. I suppose frozendict can improve memory space using shared keys and shared frozendicts. Probably I'll try to write a C extension, even if I'll need a lot of help, in another mailing list. I have some random remarks about possible improvements to dict performance: A. lookdict functions that are unicode only could return zero immediately if the searched key is not a string, instead of using the basic lookdict B. every time the dict is changed, the keys could be checked if they are all unicode with an internal version of _PyDict_HasOnlyStringKeys (I created it for frozendict, it's dict_has_only_unicode_keys_exact) and change the dk_lookup accordingly. If the function changes only one value, it's sufficient to check if the value if unicode or not, and the original lookup func C. can USABLE_FRACTION be substituted with mp->ma_keys->dk_usable?

I forgot: I noticed that creating dict from a matrix N*2 is not optimized for lists and tuples. Is this because creation from a sequence2 is not much used?
participants (6)
-
Brett Cannon
-
Guido van Rossum
-
Inada Naoki
-
Marco Sulla
-
Wes Turner
-
Yury Selivanov