I noticed something this morning: there's another way in which Inada Naoki's benchmark here is--possibly?--unrealistic.

As mentioned, his benchmark generates a thousand functions, each of which takes exactly three parameters, and each of those parameters randomly chooses one of three annotations. In current trunk (not in my branch, I'm behind), there's an optimization for stringized annotations that compiles the annotations into a tuple, and then when you pull out __annotations__ on the object at runtime it converts it into a dict on demand.

This means that even though there are a thousand functions, they only ever generate one of nine possible tuples for these annotation tuples. And here's the thing: our lovely marshal module is smart enough to notice that these tuples are duplicates, and it'll throw away the duplicates and replace them with references to the original.

Something analogous could happen in the PEP 649 branch but currently doesn't. When running Inada Noki's benchmark, there are a total of nine possible annotations code objects. Except, each function generated by the benchmark has a unique name, and I incorporate that name into the name given to the code object (f"{function_name}.__co_annotations__"). Since each function name is different, each code object name is different, so each code object hash is different, and since they aren't exact duplicates they are never consolidated.

Inada Naoki has suggested changing this, so that all the annotations code objects have the same name ("__co_annotations__"). If we made that change, I'm pretty sure the code size delta in this synthetic benchmark would drop. I haven't done it because the current name of the code object might be helpful in debugging, and I'm not convinced this would have an effect in real-world code.

But... would it? Someone, and again I think it's Inada Naoki, suggests that in real-world applications, there are often many, many functions in a single module that have identical signatures. The annotation-tuples optimization naturally takes advantage of that. PEP 649 doesn't. Should it? Would this really be beneficial to real-world code bases?

Cheers,

/arry

On 4/16/21 12:26 PM, Larry Hastings wrote:

Please don't confuse Inada Naoki's benchmark results with the effect PEP 649 would have on a real-world codebase. His artifical benchmark constructs a thousand empty functions that take three parameters with randomly-chosen annotations--the results provides some insights but are not directly applicable to reality.

PEP 649's effects on code size / memory / import time are contingent on the number of annotations and the number of objects annotated, not the overall code size of the module. Expressing it that way, and suggesting that Python users would see the same results with real-world code, is highly misleading.

I too would be interested to know the effects PEP 649 had on a real-world codebase currently using PEP 563, but AFAIK nobody has reported such results.

/arry

On 4/16/21 11:05 AM, Jukka Lehtosalo wrote:
On Fri, Apr 16, 2021 at 5:28 PM Łukasz Langa <lukasz@langa.pl> wrote:

[snip] I say "compromise" because as Inada Naoki measured, there's still a non-zero performance cost of PEP 649 versus PEP 563:

- code size: +63%

- memory: +62%
- import time: +60%

Will this hurt some current users of typing? Yes, I can name you multiple past employers of mine where this will be the case. Is it worth it for Pydantic? I tend to think that yes, it is, since it is a significant community, and the operations on type annotations it performs are in the sensible set for which `typing.get_type_hints()` was proposed.

Just to give some more context: in my experience, both import time and memory use tend to be real issues in large Python codebases (code size less so), and I think that the relative efficiency of PEP 563 is an important feature. If PEP 649 can't be made more efficient, this could be a major regression for some users. Python server applications need to run multiple processes because of the GIL, and since code objects generally aren't shared between processes (GC and reference counting makes it tricky, I understand), code size increases tend to be amplified on large servers. Even having a lot of RAM doesn't necessarily help, since a lot of RAM typically implies many CPU cores, and thus many processes are needed as well.

I can see how both PEP 563 and PEP 649 bring significant benefits, but typically for different user populations. I wonder if there's a way of combining the benefits of both approaches. I don't like the idea of having toggles for different performance tradeoffs indefinitely, but I can see how this might be a necessary compromise if we don't want to make things worse for any user groups.

Jukka
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PBJ6MBQIE3DVQUUAO764PIQ3TWGLBS3X/
Code of Conduct: http://python.org/psf/codeofconduct/