A memory map based data persistence and startup speedup approach
Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a mechanism that supports data persistence of a subset of python date types with mmap, therefore can reduce package import time by caching code object. This could be seen as a more eager pyc format, as they are for the same purpose, but our approach try to avoid [de]serialization. Therefore, we get a speedup in overall python startup by ~15%. Currently, we’ve made it a third-party library and have been working on open-sourcing. Our implementation (whose non-official name is “pycds”) mainly contains two parts: importlib hooks, this implements the mechanism to dump code objects to an archive and a `Finder` that supports loading code object from mapped memory. Dumping and loading (subset of) python types with mmap. In this part, we deal with 1) ASLR by patching `ob_type` fields; 2) hash seed randomization by supporting only basic types who don’t have hash-based layout (i.e. dict is not supported); 3) interned string by re-interning strings while loading mmap archive and so on. After pycds has been installed, complete workflow of our approach includes three parts: Record name of imported packages to heap.lst, `PYCDSMODE=TRACE PYCDSLIST=heap.lst python run.py` Dump memory archive of code objects of imported packages, this step does not involve the python script, `PYCDSMODE=DUMP PYCDSLIST=heap.lst PYCDSARCHIVE=heap.img python` Run other python processes with created archive, `PYCDSMODE=SHARE PYCDSARCHIVE=heap.img python run.py` We could even make use of immortal objects if PEP 683 [2] was accepted, that could gives CDS more performance improvements. Currently, any archived object is virtually immortal, we add rc by 1 to who has been copied to the archive to avoid being deallocated. However, without changes to CPython, rc fields of archived object will still be updated, therefore have extra footprint due to CoW. More background and detail implementation could be found at [1]. We think it could be an effective way to improve python’s startup performance, and could even do more like sharing large data between python instances. We’re welcome for suggestions and questions. Best, Yichen Yan Alibaba Compiler Group [1] “Faster startup -- Share code objects from memory-mapped file”, https://github.com/faster-cpython/ideas/discussions/150 [2] PEP 683: "Immortal Objects, Using a Fixed Refcount" (draft), https://mail.python.org/archives/list/python-dev@python.org/message/TPLEYDCX...
严懿宸(文极) via Python-ideas writes:
Hi folks, as illustrated in faster-cpython#150 [1], we have implemented a mechanism that supports data persistence of a subset of python date types with mmap, therefore can reduce package import time by caching code object.
Nice!
Currently, we’ve made it a third-party library and have been working on open-sourcing.
Thank you! I guess "working on" means "the lawyers have it" so we'll be patient. :-) I'm not sure whether your purpose is to get code review, or general exposure. If the former, you might get better results from the Speed list (https://mail.python.org/mailman3/lists/speed.python.org/). If the latter, you should also post to python-list (https://mail.python.org/mailman/listinfo/python-list or comp.lang.python on Usenet). Regards,
严懿宸(文极) via Python-ideas writes:
Currently, we’ve made it a third-party library and have been working on open-sourcing.
Stephen J. Turnbull wrote:
Thank you! I guess "working on" means "the lawyers have it" so we'll be patient. :-)
I'm not sure whether your purpose is to get code review, or general exposure. If the former, you might get better results from the Speed list (https://mail.python.org/mailman3/lists/speed.python.org/). If the latter, you should also post to python-list (https://mail.python.org/mailman/listinfo/python-list or comp.lang.python on Usenet).
I'm not sure whether the speed list is still active. I'd personally try python-dev@python.org or the python-dev section on discuss.python.org. (Full disclosure: I have seen earlier versions of Yichen's group's work and I am very impressed.) --Guido
Thanks for your advice! We're looking for opinions about the overall idea and workflow, and welcome code reviews after we get our lawyers happy and can publish the code :) I'll check python-dev to see if there're some questions. Best, Yichen ------------------------------------------------------------------ From:Guido van Rossum <guido@python.org> Sent At:2022 Feb. 21 (Mon.) 05:36 To:python-ideas <python-ideas@python.org> Subject:[Python-ideas] Re: A memory map based data persistence and startup speedup approach
严懿宸(文极) via Python-ideas writes:
Currently, we’ve made it a third-party library and have been working on open-sourcing.
Stephen J. Turnbull wrote:
Thank you! I guess "working on" means "the lawyers have it" so we'll be patient. :-)
I'm not sure whether your purpose is to get code review, or general exposure. If the former, you might get better results from the Speed list (https://mail.python.org/mailman3/lists/speed.python.org/). If the latter, you should also post to python-list (https://mail.python.org/mailman/listinfo/python-list or comp.lang.python on Usenet).
I'm not sure whether the speed list is still active. I'd personally try python-dev@python.org or the python-dev section on discuss.python.org. (Full disclosure: I have seen earlier versions of Yichen's group's work and I am very impressed.) --Guido _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6543VS... Code of Conduct: http://python.org/psf/codeofconduct/
participants (4)
-
Guido van Rossum
-
Stephen J. Turnbull
-
Yichen Yan
-
严懿宸(文极)