Investigating Python memory footprint of one real Web application
Hi, all. After reading Instagram's blog article [1], I’m thinking about how Python can reduce memory usage of Web applications. My company creating API server with Flask, SQLAlchemy and typing. (sorry, it's closed source). So I can get some data from it's codebase. [1]: https://engineering.instagram.com/dismissing-python-garbage-collection-at-in... Report is here https://gist.github.com/methane/ce723adb9a4d32d32dc7525b738d3c31 My thoughts are: * Interning (None,) seems worth enough. * There are many empty dicts. Allocating ma_keys lazily may reduce memory usage. * Most large strings are docstring. Is it worth enough that option for trim docstrings, without disabling asserts? * typing may increase memory footprint, through functions __attributes__ and abc. * Can we add option to remove or lazy evaluate __attributes__ ? * Using string literal for annotating generic type may reduce WeakRef usage. * Since typing will be used very widely in this year. Need more investigating.
2017-01-20 11:49 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
Report is here https://gist.github.com/methane/ce723adb9a4d32d32dc7525b738d3c31
Very interesting report, thanks!
My thoughts are:
* Interning (None,) seems worth enough.
I guess that (None,) comes from constants of code objects:
def f(): pass ... f.__code__.co_consts (None,)
* There are many empty dicts. Allocating ma_keys lazily may reduce memory usage.
Would you be able to estimate how many bytes would be saved by this change? With the total memory usage to have an idea of the %. By the way, it would help if you can add the total memory usage computed by tracemalloc (get_traced_memory()[0]) in your report. About empty dict, do you expect that they come from shared keys? Anyway, if it has a negligible impact on the performance, go for it :-)
but other namespaces or annotations, like ('return',) or ('__wrapped__',) are not shared
Maybe we can intern all tuple which only contains one string? Instead of interning, would it be possible to at least merge duplicated immutable objects?
* Most large strings are docstring. Is it worth enough that option for trim docstrings, without disabling asserts?
Yeah, you are not the first one to propose. The problem is to decide how to name the .pyc file. My PEP 511 proposes to add a new -O command line option and a new sys.implementation.optim_tag string to support this feature: https://www.python.org/dev/peps/pep-0511/ Since the code transformer part of the PEP seems to be controversal, maybe we should extract only these two changes from the PEP and implement them? I also want -O noopt :-) (disable the peephole optimizer) Victor
On Fri, Jan 20, 2017 at 8:17 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
2017-01-20 11:49 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
Report is here https://gist.github.com/methane/ce723adb9a4d32d32dc7525b738d3c31
Very interesting report, thanks!
My thoughts are:
* Interning (None,) seems worth enough.
I guess that (None,) comes from constants of code objects:
def f(): pass ... f.__code__.co_consts (None,)
* There are many empty dicts. Allocating ma_keys lazily may reduce memory usage.
Would you be able to estimate how many bytes would be saved by this change? With the total memory usage to have an idea of the %.
Smallest dictkeysobject is 5*8 + 8 + (8 * 3 * 5) = 168 bytes. 1600 empty dicts = 268800 bytes. Unlike tuples bound to code objects, I don't think this is so important for cache hit rate. So tuple is more important than dict.
By the way, it would help if you can add the total memory usage computed by tracemalloc (get_traced_memory()[0]) in your report.
Oh, nice to know it. I'll use it next time.
About empty dict, do you expect that they come from shared keys? Anyway, if it has a negligible impact on the performance, go for it :-)
but other namespaces or annotations, like ('return',) or ('__wrapped__',) are not shared
Maybe we can intern all tuple which only contains one string?
Ah, it's dict's key. I used print(tuple(d.keys())) to count dicts.
Instead of interning, would it be possible to at least merge duplicated immutable objects?
I meant sharing same object, I didn't meant using dict or adding bit for interning like interned strings. So I think we have same idea.
* Most large strings are docstring. Is it worth enough that option for trim docstrings, without disabling asserts?
Yeah, you are not the first one to propose. The problem is to decide how to name the .pyc file.
My PEP 511 proposes to add a new -O command line option and a new sys.implementation.optim_tag string to support this feature: https://www.python.org/dev/peps/pep-0511/
Since the code transformer part of the PEP seems to be controversal, maybe we should extract only these two changes from the PEP and implement them? I also want -O noopt :-) (disable the peephole optimizer)
Victor
On Fri, 20 Jan 2017 19:49:01 +0900 INADA Naoki <songofacandy@gmail.com> wrote:
Report is here https://gist.github.com/methane/ce723adb9a4d32d32dc7525b738d3c31
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request" You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
* Most large strings are docstring. Is it worth enough that option for trim docstrings, without disabling asserts?
Perhaps docstrings may be compressed and then lazily decompressed when accessed for the first time. lz4 and zstd are good modern candidates for that. zstd also has a dictionary mode that helps for small data (*). See https://facebook.github.io/zstd/ (*) Even a 200-bytes docstring can be compressed this way:
data = os.times.__doc__.encode() len(data) 211 len(lz4.compress(data)) 200 c = zstd.ZstdCompressor() len(c.compress(data)) 156 c = zstd.ZstdCompressor(dict_data=dict_data) len(c.compress(data)) 104
`dict_data` here is some 16KB dictionary I've trained on some Python docstrings. That 16KB dictionary could be computed while building Python (or hand-generated from time to time, since it's unlikely to change a lot) and put in a static array somewhere:
samples = [(mod.__doc__ or '').encode() for mod in sys.modules.values()] sum(map(len, samples)) 258113 dict_data = zstd.train_dictionary(16384, samples) len(dict_data.as_bytes()) 16384
Of course, compression is much more efficient on larger docstrings:
import numpy as np data = np.__doc__.encode() len(data) 3140 len(lz4.compress(data)) 2271 c = zstd.ZstdCompressor() len(c.compress(data)) 1539 c = zstd.ZstdCompressor(dict_data=dict_data) len(c.compress(data)) 1348
import pdb data = pdb.__doc__.encode() len(data) 12018 len(lz4.compress(data)) 6592 c = zstd.ZstdCompressor() len(c.compress(data)) 4502 c = zstd.ZstdCompressor(dict_data=dict_data) len(c.compress(data)) 4128
A similar strategy may be used for annotations and other rarely-accessed metadata. Another possibility, but probably much more costly in terms of initial development and maintenance, is to put the docstrings (+ annotations, etc.) in a separate file that's lazily read. I think optimizing the footprint for everyone is much better than adding command-line options to disable some specific metadata. Regards Antoine.
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request"
You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
Yes. I hadn't think static memory footprint is so important. But Instagram tried to increase CoW efficiency of prefork application, and got some success about memory usage and CPU throughput. I surprised about it because prefork only shares static memory footprint. Maybe, sharing some tuples which code object has may increase cache efficiency. I'll try run pyperformance with the marshal patch.
* Most large strings are docstring. Is it worth enough that option for trim docstrings, without disabling asserts?
Perhaps docstrings may be compressed and then lazily decompressed when accessed for the first time. lz4 and zstd are good modern candidates for that. zstd also has a dictionary mode that helps for small data (*). See https://facebook.github.io/zstd/
(*) Even a 200-bytes docstring can be compressed this way:
data = os.times.__doc__.encode() len(data) 211 len(lz4.compress(data)) 200 c = zstd.ZstdCompressor() len(c.compress(data)) 156 c = zstd.ZstdCompressor(dict_data=dict_data) len(c.compress(data)) 104
`dict_data` here is some 16KB dictionary I've trained on some Python docstrings. That 16KB dictionary could be computed while building Python (or hand-generated from time to time, since it's unlikely to change a lot) and put in a static array somewhere:
Interesting. I noticed zstd is added to mercurial (current RC version). But zstd (and brotli) are new project. I stay tuned about them.
A similar strategy may be used for annotations and other rarely-accessed metadata.
Another possibility, but probably much more costly in terms of initial development and maintenance, is to put the docstrings (+ annotations, etc.) in a separate file that's lazily read.
I think optimizing the footprint for everyone is much better than adding command-line options to disable some specific metadata.
I see. Although -OO option exists, I can't strip only SQLAlchemy's docstrings. I need to check all dependency libraries doesn't require __doc__ to use -OO in production. We have almost one year before 3.7beta1. We can find and implement better way.
Regards
Antoine.
On 2017-01-20 13:15, INADA Naoki wrote:
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request"
You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
Yes. I hadn't think static memory footprint is so important.
But Instagram tried to increase CoW efficiency of prefork application, and got some success about memory usage and CPU throughput. I surprised about it because prefork only shares static memory footprint.
Maybe, sharing some tuples which code object has may increase cache efficiency. I'll try run pyperformance with the marshal patch.
IIRC Thomas Wouters (?) has been working on a patch to move the ref counter out of the PyObject struct and into a dedicated memory area. He proposed the idea to improve cache affinity, reduce cache evictions and to make CoW more efficient. Especially modern ccNUMA machines with multiple processors could benefit from the improvement, but also single processor/multi core machines. Christian
On Fri, 20 Jan 2017 13:40:14 +0100 Christian Heimes <christian@python.org> wrote:
IIRC Thomas Wouters (?) has been working on a patch to move the ref counter out of the PyObject struct and into a dedicated memory area. He proposed the idea to improve cache affinity, reduce cache evictions and to make CoW more efficient.
Especially modern ccNUMA machines with multiple processors could benefit from the improvement, but also single processor/multi core machines.
Moving the refcount out of the PyObject will probably make increfs / decrefs more costly, and there are a lot of them. We'd have to see actual measurements if a patch is written, but my intuition is that the net result won't be positive. Regards Antoine.
Moving the refcount out of the PyObject will probably make increfs / decrefs more costly, and there are a lot of them. We'd have to see actual measurements if a patch is written, but my intuition is that the net result won't be positive.
Regards
Antoine.
I agree with you. But I have similar idea: split only PyGc_Head (3 words). SImple implementation may just use pointer to PyGc_Head instead of embedding it. +1 word for tracked objects, and -2 words for untracked objects. More complex implementation may use bitmap for tracking objects. Memory pool has the bitmap. It means GC module have own memory pool and allocator, or GC module and obmalloc are tightly coupled. But it's too hard. I don't think I can do it by Python 3.7. Reducing number of tuples may be easier.
On Fri, 20 Jan 2017 22:30:16 +0900 INADA Naoki <songofacandy@gmail.com> wrote:
Moving the refcount out of the PyObject will probably make increfs / decrefs more costly, and there are a lot of them. We'd have to see actual measurements if a patch is written, but my intuition is that the net result won't be positive.
Regards
Antoine.
I agree with you. But I have similar idea: split only PyGc_Head (3 words).
That sounds like an interesting idea. Once an object is created, the GC header is rarely accessed. Since the GC header has a small constant size, it would probably be easy to make its allocation very fast (e.g. using a freelist). Then the GC header is out of the way which increases the cache efficiency of GC-tracked objects. Regards Antoine.
Larry Hastings' Gilectomy also moved the reference counter into a separated memory block, no? (grouping all refcounts into large memory blocks if I understood correctly.) https://github.com/larryhastings/gilectomy Victor 2017-01-20 13:40 GMT+01:00 Christian Heimes <christian@python.org>:
On 2017-01-20 13:15, INADA Naoki wrote:
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request"
You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
Yes. I hadn't think static memory footprint is so important.
But Instagram tried to increase CoW efficiency of prefork application, and got some success about memory usage and CPU throughput. I surprised about it because prefork only shares static memory footprint.
Maybe, sharing some tuples which code object has may increase cache efficiency. I'll try run pyperformance with the marshal patch.
IIRC Thomas Wouters (?) has been working on a patch to move the ref counter out of the PyObject struct and into a dedicated memory area. He proposed the idea to improve cache affinity, reduce cache evictions and to make CoW more efficient. Especially modern ccNUMA machines with multiple processors could benefit from the improvement, but also single processor/multi core machines.
Christian
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/victor.stinner%40gmail.co...
On Fri, Jan 20, 2017 at 1:40 PM, Christian Heimes <christian@python.org> wrote:
On 2017-01-20 13:15, INADA Naoki wrote:
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request"
You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
Yes. I hadn't think static memory footprint is so important.
But Instagram tried to increase CoW efficiency of prefork application, and got some success about memory usage and CPU throughput. I surprised about it because prefork only shares static memory footprint.
Maybe, sharing some tuples which code object has may increase cache efficiency. I'll try run pyperformance with the marshal patch.
IIRC Thomas Wouters (?) has been working on a patch to move the ref counter out of the PyObject struct and into a dedicated memory area. He proposed the idea to improve cache affinity, reduce cache evictions and to make CoW more efficient. Especially modern ccNUMA machines with multiple processors could benefit from the improvement, but also single processor/multi core machines.
FWIW, I have a working patch for that (against trunk a few months back, even though the original idea was for the gilectomy branch), moving just the refcount and not PyGC_HEAD. Performance-wise, in the benchmarks it's a small but consistent loss (2-5% on a noisy machine, as measured by python-benchmarks, not perf), and it breaks the ABI as well as any code that dereferences PyObject.ob_refcnt directly (the field was repurposed and renamed, and exposed as a const* to avoid direct assignment). It also exposes the API awkwardness that CPython doesn't *require* objects to go through a specific mechanism for object initialisation, even though nearly all extension modules do so. (That same API awkwardness made life a little harder when experimenting with BDW GC :P.) I don't believe external refcounts can be made the default without careful redesigning of a new set of PyObject API calls and deprecation of the old ones. -- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!
1. It looks like there is still a room for performance improvement of typing w.r.t. how ABCs and issubclass() works. I will try to play with this soon. (the basic idea is that some steps could be avoided for parameterized generics). 2. I am +1 on having three separate options to independently ignore asserts, docstrings, and annotations. 3. I am -1 on ignoring annotations altogether. Sometimes they could be helpful at runtime: typing.NamedTuple and mypy_extensions.TypedDict are two examples. Also some people use annotations for runtime checks or even for things unrelated to typing. I think it would be a pity to lose these functionalities for small performance gains. -- Ivan
3. I am -1 on ignoring annotations altogether. Sometimes they could be helpful at runtime: typing.NamedTuple and mypy_extensions.TypedDict are two examples.
ignoring annotations doesn't mean ignoring typing at all. You can use typing.NamedTuple even when functions doesn't have __annotations__.
Also some people use annotations for runtime checks or even for things unrelated to typing. I think it would be a pity to lose these functionalities for small performance gains.
Sure. It should be option, for backward compatibility. Regards,
FWIW, I tried to skip compiler_visit_annotations() in Python/compile.c a) default: 41278060 b) remove annotations: 37140094 c) (b) + const merge: 35933436 (a-b)/a = 10% (a-c)/a = 13% And here are top 3 tracebacks from tracemalloc: 15109615 (/180598) File "<frozen importlib._bootstrap_external>", line 488 File "<frozen importlib._bootstrap_external>", line 780 File "<frozen importlib._bootstrap_external>", line 675 File "<frozen importlib._bootstrap>", line 655 1255632 (/8316) File "/home/inada-n/local/cpython/lib/python3.7/_weakrefset.py", line 84 self.data.add(ref(item, self._remove)) File "/home/inada-n/local/cpython/lib/python3.7/abc.py", line 230 cls._abc_negative_cache.add(subclass) File "/home/inada-n/local/cpython/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/home/inada-n/local/cpython/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): 1056744 (/4020) File "/home/inada-n/local/cpython/lib/python3.7/abc.py", line 133 cls = super().__new__(mcls, name, bases, namespace) File "/home/inada-n/local/cpython/lib/python3.7/typing.py", line 125 return super().__new__(cls, name, bases, namespace) File "/home/inada-n/local/cpython/lib/python3.7/typing.py", line 977 self = super().__new__(cls, name, bases, namespace, _root=True) File "/home/inada-n/local/cpython/lib/python3.7/typing.py", line 1105 orig_bases=self.__orig_bases__) Regards,
2017-01-24 15:00 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
And here are top 3 tracebacks from tracemalloc:
15109615 (/180598) File "<frozen importlib._bootstrap_external>", line 488 File "<frozen importlib._bootstrap_external>", line 780 File "<frozen importlib._bootstrap_external>", line 675 File "<frozen importlib._bootstrap>", line 655
FYI at Python startup, usually the largest memory block comes from the dictionary used to intern all strings ("interned" in unicodeobject.c). The traceback is never revelant for this specific object. Victor
On Tue, Jan 24, 2017 at 11:08 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
2017-01-24 15:00 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
And here are top 3 tracebacks from tracemalloc:
15109615 (/180598) File "<frozen importlib._bootstrap_external>", line 488 File "<frozen importlib._bootstrap_external>", line 780 File "<frozen importlib._bootstrap_external>", line 675 File "<frozen importlib._bootstrap>", line 655
FYI at Python startup, usually the largest memory block comes from the dictionary used to intern all strings ("interned" in unicodeobject.c). The traceback is never revelant for this specific object.
Victor
Yes! I used a few hours to notice it. When PYTHONTRACEMALLOC=10, marshal.loads() of small module (15KB pyc) looks eating 1.3MB. I think small stacktrace depth (3~4) is better for showing summary of large application. BTW, about 1.3MB of 15MB (mashal.loads()) was for intern dict, as far as I remember.
More detailed information: ## With annotations === tracemalloc stat === traced: (46969277, 46983753) 18,048,888 / 181112 File "<frozen importlib._bootstrap_external>", line 488 File "<frozen importlib._bootstrap_external>", line 780 File "<frozen importlib._bootstrap_external>", line 675 === size by types === dict 9,083,816 (8,870.91KB) / 21846 = 415.811bytes (21.38%) tuple 6,420,960 (6,270.47KB) / 86781 = 73.990bytes (15.11%) str 6,050,213 (5,908.41KB) / 77527 = 78.040bytes (14.24%) function 2,772,224 (2,707.25KB) / 20384 = 136.000bytes (6.53%) code 2,744,888 (2,680.55KB) / 18987 = 144.567bytes (6.46%) type 2,713,552 (2,649.95KB) / 2769 = 979.975bytes (6.39%) bytes 2,650,838 (2,588.71KB) / 38723 = 68.456bytes (6.24%) set 2,445,280 (2,387.97KB) / 6969 = 350.880bytes (5.76%) weakref 1,255,600 (1,226.17KB) / 15695 = 80.000bytes (2.96%) list 707,336 (690.76KB) / 6628 = 106.719bytes (1.66%) === dict stat === t, size, total (%) / count 3, 256, 1,479,424 (15.68%) / 5779.0 3, 1,200, 1,330,800 (14.11%) / 1109.0 3, 1,310,832, 1,310,832 (13.90%) / 1.0 3, 664, 1,287,496 (13.65%) / 1939.0 7, 128, 756,352 (8.02%) / 5909.0 3, 384, 707,328 (7.50%) / 1842.0 3, 2,296, 642,880 (6.81%) / 280.0 0, 256, 378,112 (4.01%) / 1477.0 7, 168, 251,832 (2.67%) / 1499.0 3, 4,720, 221,840 (2.35%) / 47.0 3, 9,336, 130,704 (1.39%) / 14.0 7, 88, 105,072 (1.11%) / 1194.0 * t=7 key-sharing dict, t=3 interned string key only, t=1 string key only, t=0 non string key is used ## Stripped annotations === tracemalloc stat === traced: (42383739, 42397983) 18,069,806 / 181346 File "<frozen importlib._bootstrap_external>", line 488 File "<frozen importlib._bootstrap_external>", line 780 File "<frozen importlib._bootstrap_external>", line 675 === size by types === dict 7,913,144 (7,727.68KB) / 17598 = 449.662bytes (20.62%) tuple 6,149,120 (6,005.00KB) / 82734 = 74.324bytes (16.02%) str 6,070,083 (5,927.82KB) / 77741 = 78.081bytes (15.82%) code 2,744,312 (2,679.99KB) / 18983 = 144.567bytes (7.15%) type 2,713,552 (2,649.95KB) / 2769 = 979.975bytes (7.07%) bytes 2,650,464 (2,588.34KB) / 38715 = 68.461bytes (6.91%) function 2,547,280 (2,487.58KB) / 18730 = 136.000bytes (6.64%) set 1,423,520 (1,390.16KB) / 4627 = 307.655bytes (3.71%) list 634,472 (619.60KB) / 5454 = 116.331bytes (1.65%) int 608,784 (594.52KB) / 21021 = 28.961bytes (1.59%) === dict stat === t, size, total (%) / count 3, 1,200, 1,316,400 (16.06%) / 1097.0 3, 1,310,832, 1,310,832 (16.00%) / 1.0 3, 664, 942,216 (11.50%) / 1419.0 3, 256, 861,184 (10.51%) / 3364.0 3, 384, 657,024 (8.02%) / 1711.0 3, 2,296, 640,584 (7.82%) / 279.0 7, 128, 606,464 (7.40%) / 4738.0 0, 256, 379,904 (4.64%) / 1484.0 7, 168, 251,832 (3.07%) / 1499.0 3, 4,720, 221,840 (2.71%) / 47.0 3, 9,336, 130,704 (1.59%) / 14.0 7, 88, 105,248 (1.28%) / 1196.0 7, 256, 86,784 (1.06%) / 339.0 ## Stripped annotation + without pydebug === tracemalloc stat === traced: (37371660, 40814265) 9,812,083 / 111082 File "<frozen importlib._bootstrap>", line 205 File "<frozen importlib._bootstrap_external>", line 742 File "<frozen importlib._bootstrap_external>", line 782 6,761,207 / 85614 File "<frozen importlib._bootstrap_external>", line 488 File "<frozen importlib._bootstrap_external>", line 780 File "<frozen importlib._bootstrap_external>", line 675 ## Ideas about memory optimize a) Split PyGC_Head from object Reduces 2words (16byte) from each tuples.
82734 * 16 / 1024 1292.71875
So estimated -1.2MB b) concat co_consts, co_names, co_varnames, co_freevars into one tuple, or embed them into code. Each tuple has 3 (gc head) + 3 (refcnt, *type, length) = 6 words overhead. (or 4 words if (a) is applied) If we can reduce 3 tuples, 18 words = 144byte (or 12 words=96byte) can be reuduced.
18983 * 144 2733552 18983 * 96 1822368
But co_freevars is empty tuple in most cases. So real effect is smaller than 2.7MB. If we can embed them into code object, we can estimate -2.7MB. (There are co_cellvars too. But I don't know about it much, especially it is GC tracked or not) c) (interned) string key only dict. 20% of memory is used for dict, and 70% of dict has interned string keys. Current DictKeyEntry is 3 words: {key, hash, value}. But if we can assume all key is string, hash can be get from the key. If we use 2 words entry: {key, value} for such dict, I think dict can be 25% smaller.
7913144 * 0.25 / 1024 1931.919921875
So I estimate -1.9MB If we can implement (a)~(c) I estimate memory usage on Python (--without-pydebug) can be reduced from 35.6MB to 30MB, roughly. But I think -Onoannotation option is top priority. It can reduce 4MB, even we use annotations only in our code. If major libraries start using annotations, this number will be larger.
On Wed, 25 Jan 2017 20:54:02 +0900 INADA Naoki <songofacandy@gmail.com> wrote:
## Stripped annotation + without pydebug
Does this mean the other measurements were done with pydebug enabled? pydebug is not meant to be used on production systems so, without wanting to disparage the effort this went into these measurements, I'm afraid that makes them not very useful. Regards Antoine.
On Thu, Jan 26, 2017 at 2:33 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Wed, 25 Jan 2017 20:54:02 +0900 INADA Naoki <songofacandy@gmail.com> wrote:
## Stripped annotation + without pydebug
Does this mean the other measurements were done with pydebug enabled? pydebug is not meant to be used on production systems so, without wanting to disparage the effort this went into these measurements, I'm afraid that makes them not very useful.
Regards
Antoine.
Yes. I used sys.getobjects() API which is available only in pydebug mode. Since it adds two words to all objects for doubly linked list, I did sys.getsizeof(o) - 16 when calculating memory used by each type. While it may bit different from --without-pydebug, I believe it's useful enough to estimate how much memory is used by each types.
On Jan 24, 2017 3:35 AM, "Thomas Wouters" <thomas@python.org> wrote: On Fri, Jan 20, 2017 at 1:40 PM, Christian Heimes <christian@python.org> wrote:
On 2017-01-20 13:15, INADA Naoki wrote:
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request"
You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
Yes. I hadn't think static memory footprint is so important.
But Instagram tried to increase CoW efficiency of prefork application, and got some success about memory usage and CPU throughput. I surprised about it because prefork only shares static memory footprint.
Maybe, sharing some tuples which code object has may increase cache efficiency. I'll try run pyperformance with the marshal patch.
IIRC Thomas Wouters (?) has been working on a patch to move the ref counter out of the PyObject struct and into a dedicated memory area. He proposed the idea to improve cache affinity, reduce cache evictions and to make CoW more efficient. Especially modern ccNUMA machines with multiple processors could benefit from the improvement, but also single processor/multi core machines.
FWIW, I have a working patch for that (against trunk a few months back, even though the original idea was for the gilectomy branch), moving just the refcount and not PyGC_HEAD. Performance-wise, in the benchmarks it's a small but consistent loss (2-5% on a noisy machine, as measured by python-benchmarks, not perf), and it breaks the ABI as well as any code that dereferences PyObject.ob_refcnt directly (the field was repurposed and renamed, and exposed as a const* to avoid direct assignment). It also exposes the API awkwardness that CPython doesn't *require* objects to go through a specific mechanism for object initialisation, even though nearly all extension modules do so. (That same API awkwardness made life a little harder when experimenting with BDW GC :P.) I don't believe external refcounts can be made the default without careful redesigning of a new set of PyObject API calls and deprecation of the old ones. The thing I found most surprising about that blog post was that contrary to common wisdom, refcnt updates per se had essentially no effect on the amount of memory shared between CoW processes, and the problems were all due to the cycle collector. (Though I guess it's still possible that part of the problems caused by the cycle collector are due to it touching ob_refcnt.) It's promising too though, because the GC metadata is much less exposed to extension modules than PyObject_HEAD is, and the access patterns are presumably (?) much more bursty. It'd be really interesting to see how things performed if packing just PyGC_HEAD but *not* ob_refcnt into a dedicated region. -n
On Tue, 24 Jan 2017 10:21:45 -0800 Nathaniel Smith <njs@pobox.com> wrote:
The thing I found most surprising about that blog post was that contrary to common wisdom, refcnt updates per se had essentially no effect on the amount of memory shared between CoW processes, and the problems were all due to the cycle collector.
Indeed, it was unexpected, though it can be explained easily: refcount updates touch only the live working set, while GC passes scan through all existing objects, even those that are never actually used. Regards Antoine.
On 2017-01-20 13:15, INADA Naoki wrote:
"this script counts static memory usage. It doesn’t care about dynamic memory usage of processing real request"
You may be trying to optimize something which is only a very small fraction of your actual memory footprint. That said, the marshal module could certainly try to intern some tuples and other immutable structures.
Yes. I hadn't think static memory footprint is so important.
But Instagram tried to increase CoW efficiency of prefork application, and got some success about memory usage and CPU throughput. I surprised about it because prefork only shares static memory footprint.
Maybe, sharing some tuples which code object has may increase cache efficiency. I'll try run pyperformance with the marshal patch.
IIRC Thomas Wouters (?) has been working on a patch to move the ref counter out of the PyObject struct and into a dedicated memory area. He proposed the idea to improve cache affinity, reduce cache evictions and to make CoW more efficient. Especially modern ccNUMA machines with multiple processors could benefit from the improvement, but also single processor/multi core machines. Christian
On 20 January 2017 at 11:49, INADA Naoki <songofacandy@gmail.com> wrote:
* typing may increase memory footprint, through functions __attributes__ and abc. * Can we add option to remove or lazy evaluate __attributes__ ?
This idea already appeared few times. I proposed to introduce a flag (e.g. -OOO) to ignore function and variable annotations in compile.c It was decide to postpone this, but maybe we can get back to this idea. In 3.6, typing is already (quite heavily) optimized for both speed and space. I remember doing an experiment comparing a memory footprint with and without annotations, the difference was few percent. Do you have such comparison (with and without annotations) for your app? It would be nice to have a realistic number to estimate what would the additional optimization flag save. -- Ivan
On Fri, Jan 20, 2017 at 8:52 PM, Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
On 20 January 2017 at 11:49, INADA Naoki <songofacandy@gmail.com> wrote:
* typing may increase memory footprint, through functions __attributes__ and abc. * Can we add option to remove or lazy evaluate __attributes__ ?
This idea already appeared few times. I proposed to introduce a flag (e.g. -OOO) to ignore function and variable annotations in compile.c It was decide to postpone this, but maybe we can get back to this idea.
In 3.6, typing is already (quite heavily) optimized for both speed and space. I remember doing an experiment comparing a memory footprint with and without annotations, the difference was few percent. Do you have such comparison (with and without annotations) for your app? It would be nice to have a realistic number to estimate what would the additional optimization flag save.
I'm sorry. I just read the blog article yesterday and investigate one application today. I don't have idea how to compare memory overhead of __annotations__ yet. And the project I borrowed codebase start using typing very recently, after reading Dropbox's story. So I don't know how % of functions are typed. I'll survey more about it later, hopefully in this month.
On Fri, Jan 20, 2017 at 8:52 PM, Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
On 20 January 2017 at 11:49, INADA Naoki <songofacandy@gmail.com> wrote:
* typing may increase memory footprint, through functions __attributes__ and abc. * Can we add option to remove or lazy evaluate __attributes__ ?
This idea already appeared few times. I proposed to introduce a flag (e.g. -OOO) to ignore function and variable annotations in compile.c It was decide to postpone this, but maybe we can get back to this idea.
In 3.6, typing is already (quite heavily) optimized for both speed and space. I remember doing an experiment comparing a memory footprint with and without annotations, the difference was few percent. Do you have such comparison (with and without annotations) for your app? It would be nice to have a realistic number to estimate what would the additional optimization flag save.
-- Ivan
Hi, Ivan. I investigated why our app has so many WeakSet today. We have dozen or hundreds of annotations like Iterable[User] or List[User]. (User is one example of application's domain object. There are hundreds of classes). On the other hand, SQLAlchemy calls isinstance(obj, collections.Iterable) many times, in [sqlalchemy.util._collections.to_list](https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/util/_collec...) method. So there are (# of iterable subclasses) weaksets for negative cache, and each weakset contains (# of column types) entries. That's why WeakSet ate much RAM. It may be slowdown application startup too, because thousands of __subclasscheck_ is called. I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important. FWIW, stacktrace is like this: File "/Users/inada-n/local/py37dbg/lib/python3.7/_weakrefset.py", line 84 self.data.add(ref(item, self._remove)) File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 233 cls._abc_negative_cache.add(subclass) File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 191 return cls.__subclasscheck__(subclass) File "venv/lib/python3.7/site-packages/sqlalchemy/util/_collections.py", line 803 or not isinstance(x, collections.Iterable): File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1680 columns = util.to_list(prop) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1575 prop = self._property_from_column(key, prop) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1371 setparent=True) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 675 self._configure_properties()<PasteEnd> Regards,
2017-01-23 12:25 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important.
I would prefer a Python option (ex: "-o noannotation" command line option) to opt-out annotations rather than having to write annotations in strings, which is IMHO more "ugly". Victor
On Mon, Jan 23, 2017 at 8:33 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
2017-01-23 12:25 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important.
I would prefer a Python option (ex: "-o noannotation" command line option) to opt-out annotations rather than having to write annotations in strings, which is IMHO more "ugly".
Victor
Personally speaking, I hope annotations are just static hint, and makes zero overhead at runtime. (startup time, memory consumption, and execution speed). Anyway, many users are starting to use typing, for code completion or static checking. And very few user noticed it affects to performance of `isinstance(x, collections.Sequence)`. Python 3.7 may be too slow to help them. Can't we skip abc registration of typing.List[MyClass] completely? I'm sorry if it's silly idea. I don't know about background of current typing.py design. And I don't use abc much too. Naoki
On Mon, 23 Jan 2017 at 04:27 INADA Naoki <songofacandy@gmail.com> wrote:
On Mon, Jan 23, 2017 at 8:33 PM, Victor Stinner <victor.stinner@gmail.com> wrote:
2017-01-23 12:25 GMT+01:00 INADA Naoki <songofacandy@gmail.com>:
I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important.
I would prefer a Python option (ex: "-o noannotation" command line option) to opt-out annotations rather than having to write annotations in strings, which is IMHO more "ugly".
So basically the equivalent of -OO for docstrings? Maybe this can be the final motivator for some of us to come up with a design to generalize -O or something as it keeps coming up.
Victor
Personally speaking, I hope annotations are just static hint, and makes zero overhead at runtime. (startup time, memory consumption, and execution speed).
Local variable annotations are nothing but info in the AST. Parameter annotations and class annotations are stored on their respective objects so there's memory usage from that and the construction of them at object creation time, but that's it (e.g. the cost of creating func.__annotations__ when the function object is created is all you pay for performance-wise). And using strings will make those introspective attributes difficult to use, hence why I don't think people have said to do that everywhere.
Anyway, many users are starting to use typing, for code completion or static checking. And very few user noticed it affects to performance of `isinstance(x, collections.Sequence)`. Python 3.7 may be too slow to help them. Can't we skip abc registration of typing.List[MyClass] completely?
I'm sorry if it's silly idea. I don't know about background of current typing.py design. And I don't use abc much too.
Since isinstance() checks are expected to be rare I don't think anyone has worried too much about the performance beyond the initial work to introduce ABCs and __instancecheck__.
On Jan 23, 2017, at 12:10 PM, Brett Cannon <brett@python.org> wrote:
On Mon, 23 Jan 2017 at 04:27 INADA Naoki <songofacandy@gmail.com <mailto:songofacandy@gmail.com>> wrote: On Mon, Jan 23, 2017 at 8:33 PM, Victor Stinner <victor.stinner@gmail.com <mailto:victor.stinner@gmail.com>> wrote:
2017-01-23 12:25 GMT+01:00 INADA Naoki <songofacandy@gmail.com <mailto:songofacandy@gmail.com>>:
I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important.
I would prefer a Python option (ex: "-o noannotation" command line option) to opt-out annotations rather than having to write annotations in strings, which is IMHO more "ugly".
So basically the equivalent of -OO for docstrings? Maybe this can be the final motivator for some of us to come up with a design to generalize -O or something as it keeps coming up.
Yes, please. We've talked about generalizing this for years now. FWIW, I know of projects that run with -OO for the memory wins stemming from docstrings and had to codemod assert statements into a "prod_assert" function call to achieve this. If you think docstrings aren't that much, multiply this by a few hundred processes on a box and it ends up being a substantial win to strip them out.
Victor
Personally speaking, I hope annotations are just static hint, and makes zero overhead at runtime. (startup time, memory consumption, and execution speed).
Local variable annotations are nothing but info in the AST. Parameter annotations and class annotations are stored on their respective objects so there's memory usage from that and the construction of them at object creation time, but that's it (e.g. the cost of creating func.__annotations__ when the function object is created is all you pay for performance-wise). And using strings will make those introspective attributes difficult to use, hence why I don't think people have said to do that everywhere.
I suggested making all annotations just strings at runtime and PEP 484 still lists this as a possible course for the future. So far Guido blocked this on a legitimate question: how much do type hints actually cost? Nobody knows yet, the biggest annotated codebase is at Dropbox and this is using comments (so no runtime cost).
Anyway, many users are starting to use typing, for code completion or static checking. And very few user noticed it affects to performance of `isinstance(x, collections.Sequence)`. Python 3.7 may be too slow to help them. Can't we skip abc registration of typing.List[MyClass] completely?
I'm sorry if it's silly idea. I don't know about background of current typing.py design. And I don't use abc much too.
Since isinstance() checks are expected to be rare I don't think anyone has worried too much about the performance beyond the initial work to introduce ABCs and __instancecheck__.
Similar to the above, I would advise against crippling functionality unless we prove this is affecting performance in a significant way. - Ł
So basically the equivalent of -OO for docstrings? Maybe this can be the final motivator for some of us to come up with a design to generalize -O or something as it keeps coming up. Yes, please. We've talked about generalizing this for years now. FWIW, I know of projects that run with -OO for the memory wins stemming from docstrings and had to codemod assert statements into a "prod_assert" function call to achieve this. If you think docstrings aren't that much, multiply this by a few hundred processes on a box and it ends up being a substantial win to strip them out.
Strong +1.
So far Guido blocked this on a legitimate question: how much do type hints actually cost? Nobody knows yet,
"Nobody knows yet" is difficult problem. We may think "let's keep runtime cost, because nobody knows how large it is". Users may think "let's use string/comment form annotation to avoid runtime cost, because nobody knows how large it is." And problem may happen in closed source application. When building closed source application, the project can drop Python 2 support easily, and buy PyCharm for all members. (BTW, PyCharm's survey result [1] is very encouraging. PyCharm users adopts Python 3 (relative) early. I think they will adopt typing early too.) Early and large adopters of typing may be such teams (like my company). And they may feel "Python is slow and fat!" if there are no easy way to check runtime overhead of typing. Optimize option to drop annotation will provide (1) easy way to check runtime overhead of typing, and (2) straightforward solution to remove the overhead, if it isn't negligible. [1]: https://www.jetbrains.com/pycharm/python-developers-survey-2016/
Inada-san, I have made a PR for typing module upstream https://github.com/python/typing/pull/383 It should reduce the memory consumption significantly (and also increase isinstance() speed). Could you please try it with your real code base and test memory consumption (and maybe speed) as compared to master? -- Ivan On 23 January 2017 at 12:25, INADA Naoki <songofacandy@gmail.com> wrote:
On Fri, Jan 20, 2017 at 8:52 PM, Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
On 20 January 2017 at 11:49, INADA Naoki <songofacandy@gmail.com> wrote:
* typing may increase memory footprint, through functions __attributes__ and abc. * Can we add option to remove or lazy evaluate __attributes__ ?
This idea already appeared few times. I proposed to introduce a flag (e.g. -OOO) to ignore function and variable annotations in compile.c It was decide to postpone this, but maybe we can get back to this idea.
In 3.6, typing is already (quite heavily) optimized for both speed and space. I remember doing an experiment comparing a memory footprint with and without annotations, the difference was few percent. Do you have such comparison (with and without annotations) for your app? It would be nice to have a realistic number to estimate what would the additional optimization flag save.
-- Ivan
Hi, Ivan.
I investigated why our app has so many WeakSet today.
We have dozen or hundreds of annotations like Iterable[User] or List[User]. (User is one example of application's domain object. There are hundreds of classes).
On the other hand, SQLAlchemy calls isinstance(obj, collections.Iterable) many times, in [sqlalchemy.util._collections.to_list](https://github.com/ zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/util/_ collections.py#L795-L804) method.
So there are (# of iterable subclasses) weaksets for negative cache, and each weakset contains (# of column types) entries. That's why WeakSet ate much RAM.
It may be slowdown application startup too, because thousands of __subclasscheck_ is called.
I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important.
FWIW, stacktrace is like this:
File "/Users/inada-n/local/py37dbg/lib/python3.7/_weakrefset.py", line 84 self.data.add(ref(item, self._remove)) File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 233 cls._abc_negative_cache.add(subclass) File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 191 return cls.__subclasscheck__(subclass) File "venv/lib/python3.7/site-packages/sqlalchemy/util/_collections.py", line 803 or not isinstance(x, collections.Iterable): File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1680 columns = util.to_list(prop) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1575 prop = self._property_from_column(key, prop) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1371 setparent=True) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 675 self._configure_properties()<PasteEnd>
Regards,
Thanks, Ivan. I confirmed 3000 negative cache entries were removed by your patch! https://gist.github.com/methane/3c34f11fb677365a7e92afe73aca24e7 On Thu, Feb 2, 2017 at 1:16 AM, Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
Inada-san,
I have made a PR for typing module upstream https://github.com/python/typing/pull/383 It should reduce the memory consumption significantly (and also increase isinstance() speed). Could you please try it with your real code base and test memory consumption (and maybe speed) as compared to master?
-- Ivan
On 23 January 2017 at 12:25, INADA Naoki <songofacandy@gmail.com> wrote:
On Fri, Jan 20, 2017 at 8:52 PM, Ivan Levkivskyi <levkivskyi@gmail.com> wrote:
On 20 January 2017 at 11:49, INADA Naoki <songofacandy@gmail.com> wrote:
* typing may increase memory footprint, through functions __attributes__ and abc. * Can we add option to remove or lazy evaluate __attributes__ ?
This idea already appeared few times. I proposed to introduce a flag (e.g. -OOO) to ignore function and variable annotations in compile.c It was decide to postpone this, but maybe we can get back to this idea.
In 3.6, typing is already (quite heavily) optimized for both speed and space. I remember doing an experiment comparing a memory footprint with and without annotations, the difference was few percent. Do you have such comparison (with and without annotations) for your app? It would be nice to have a realistic number to estimate what would the additional optimization flag save.
-- Ivan
Hi, Ivan.
I investigated why our app has so many WeakSet today.
We have dozen or hundreds of annotations like Iterable[User] or List[User]. (User is one example of application's domain object. There are hundreds of classes).
On the other hand, SQLAlchemy calls isinstance(obj, collections.Iterable) many times, in [sqlalchemy.util._collections.to_list](https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/util/_collec...) method.
So there are (# of iterable subclasses) weaksets for negative cache, and each weakset contains (# of column types) entries. That's why WeakSet ate much RAM.
It may be slowdown application startup too, because thousands of __subclasscheck_ is called.
I gave advice to use 'List[User]' instead of List[User] to the team of the project, if the team think RAM usage or boot speed is important.
FWIW, stacktrace is like this:
File "/Users/inada-n/local/py37dbg/lib/python3.7/_weakrefset.py", line 84 self.data.add(ref(item, self._remove)) File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 233 cls._abc_negative_cache.add(subclass) File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 226 if issubclass(subclass, scls): File "/Users/inada-n/local/py37dbg/lib/python3.7/abc.py", line 191 return cls.__subclasscheck__(subclass) File "venv/lib/python3.7/site-packages/sqlalchemy/util/_collections.py", line 803 or not isinstance(x, collections.Iterable): File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1680 columns = util.to_list(prop) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1575 prop = self._property_from_column(key, prop) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 1371 setparent=True) File "venv/lib/python3.7/site-packages/sqlalchemy/orm/mapper.py", line 675 self._configure_properties()<PasteEnd>
Regards,
I've filed an issue about merging tuples: http://bugs.python.org/issue29336 I'll try the patch with my company's codebase again in next week. But could someone try the patch with realworld large application too? Or if you know OSS large application easy to install, could you share requirements.txt + script imports many modules in the application?
participants (9)
-
Antoine Pitrou
-
Brett Cannon
-
Christian Heimes
-
INADA Naoki
-
Ivan Levkivskyi
-
Lukasz Langa
-
Nathaniel Smith
-
Thomas Wouters
-
Victor Stinner