On Tue, Jul 9, 2019 at 5:29 PM Inada Naoki <songofacandy@gmail.com> wrote:
On Tue, Jul 9, 2019 at 9:46 AM Tim Peters <tim.peters@gmail.com> wrote:
I was more intrigued by your first (speed) comparison:
- spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
Now _that's_ interesting ;-) Looks like spectral_norm recycles many short-lived Python floats at a swift pace. So memory management should account for a large part of its runtime (the arithmetic it does is cheap in comparison), and obmalloc and mimalloc should both excel at recycling mountains of small objects. Why is mimalloc significantly faster?
Totally agree. I'll investigate this next.
I compared "perf" output of mimalloc and pymalloc, and I succeeded to optimize pymalloc! $ ./python bm_spectral_norm.py --compare-to ./python-master python-master: ..................... 199 ms +- 1 ms python: ..................... 182 ms +- 4 ms Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +- 4 ms: 1.10x faster (-9%) mimalloc uses many small static (inline) functions. On the other hand, pymalloc_alloc and pymalloc_free is large function containing slow/rare path. PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free. But compiler doesn't know which is the hot part in pymalloc_alloc and pymalloc_free. So gcc failed to chose code to inline. Remaining part of pymalloc_alloc and pymalloc_free are called and many push/pop are executed because they contains complex logic. So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part. But I need to use "static inline" for pymalloc_alloc and pymalloc_free yet [1]. Generated assembly is optimized well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3]. But there are many code duplication in PyObject_Malloc and PyObject_Calloc, etc... [1] https://github.com/python/cpython/pull/14674/files [2] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmall... [3] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmall... I will try to split pymalloc_alloc and pymalloc_free to smaller functions. Except above, there is one more important difference. pymalloc return free pool to freepool list soon when pool become empty. On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc) not everytime when it's empty [4]. So they can avoid rebuilding linked list of free blocks. I think pymalloc should do same optimization. [4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e... BTW, which is proper name? pymalloc, or obmalloc. Regards, -- Inada Naoki <songofacandy@gmail.com>