[Python-Dev] The untuned tunable parameter ARENA_SIZE

Thu Jun 1 03:57:04 EDT 2017

2017-06-01 9:38 GMT+02:00 Larry Hastings <larry at hastings.org>:
> When CPython's small block allocator was first merged in late February 2001,
> it allocated memory in gigantic chunks it called "arenas".  These arenas
> were a massive 256 KILOBYTES apiece.

The arena size defines the strict minimum memory usage of Python. With
256 kB, it means that the smallest memory usage is 256 kB.

> What would be the result of making the arena size 4mb?

A minimum memory usage of 4 MB. It also means that if you allocate 4
MB + 1 byte, Python will eat 8 MB from the operating system.

The GNU libc malloc uses a variable threshold to choose between sbrk()
(heap memory) or mmap(). It starts at 128 kB or 256 kB, and then is
adapted depending on the workload (I don't know how exactly).

I would prefer to have an adaptative arena size. For example start at
256 kB and then double the arena size while the memory usage grows.
The problem is that pymalloc is currently designed for a fixed arena
size. I have no idea how hard it would be to make the size per
allocated arena.

I already read that CPU support "large pages" between 2 MB and 1 GB,
instead of just 4 kB. Using large pages can have a significant impact
on performance. I don't know if we can do something to help the Linux
kernel to use large pages for our memory? I don't know neither how we
could do that :-) Maybe using mmap() closer to large pages will help
Linux to join them to build a big page? (Linux has something magic to
make applications use big pages transparently.)

More generally: I'm strongly in favor of making our memory allocator
more efficient :-D

When I wrote my tracemalloc PEP 454, I counted that Python calls
malloc() , realloc() or free() 270,000 times per second in average
when running the Python test suite:
https://www.python.org/dev/peps/pep-0454/#log-calls-to-the-memory-allocator
(now I don't recall if it was really "malloc" or PyObject_Malloc, but
well, we do a lot of memory allocations and deallocations ;-))

When I analyzed the timeline of CPython master performance, I was
surprised to see that my change on PyMem_Malloc() to make it use
pymalloc was one of the most significant "optimization" of the Python
3.6!
http://pyperformance.readthedocs.io/cpython_results_2017.html#pymalloc-allocator

The CPython performance heavily depends on the performance of our
memory allocator, at least of the performance of pymalloc (the
specialized allocator for blocks <= 512 bytes).

By the way, Naoki INADA also proposed a different idea:

"Global freepool: Many types has it’s own freepool. Sharing freepool
can increase memory and cache efficiency. Add PyMem_FastFree(void*
ptr, size_t size) to store memory block to freepool, and PyMem_Malloc
can check global freepool first."
http://faster-cpython.readthedocs.io/cpython37.html

IMHO It's also worth it to investigate this change as well.

Victor