Re: [Python-Dev] binary trees. Review obmalloc.c

4 May 2006

      Tim Peters wrote:
...
[Vladimir 'Yu' Stepanov]
...
...
* To adapt allocation of blocks of memory with other alignment. Now
alignment is rigidly set on 8 bytes. As a variant, it is possible to
use alignment on 4 bytes. And this value can be set at start of the
interpreter through arguments/variable environments/etc. At testing
with alignment on 4 or 8 bytes difference in speed of work was not
appreciable.
[Martin v. Löwis]
...
That depends on the hardware you use, of course. Some architectures
absolutely cannot stand mis-aligned accesses, and programs will just
crash if they try to perform such accesses.
Note that we _had_ to add a goofy "long double" to the PyGC_Head union
because some Python platform required 8-byte alignment for some types
... see rev 25454.  Any spelling of malloc() also needs to return
memory aligned for any legitimate purpose, and 8-byte alignment is the
strictest requirement we know of across current Python platforms.
It is possible to receive the reference about these tests? I have lead
some tests, which very small difference of speed of work at alignment on
4 and 8 byte (a maximum of 8%, and in the basic less than one percent).
In tests consecutive search of elements in one piece of memory was used.
I understand, that it is necessary to lead still the test with a casual
choice of addresses to minimize optimization of work cache the processor.
And certainly not all platforms will show the same result (I shall try
to not forget to attach a source code of the test and its result of
work :) ).

On the other hand reduction of a memory size by each separate copy of
object is capable to increase speed by the same percent that is lost at
change of displacement about 8 bytes up to 4 :) Besides begins to
refuse probably superstructures on allocation of memory at some objects,
since int.
...
...
So Python should err on the safe side, and only use a smaller alignment
when it is known not to hurt.
OTOH, I don't see the *advantage* in reducing the alignment.
It could cut wasted bytes.  There is no per-object memory overhead in
a release-build obmalloc:  the allocatable part of a pool is entirely
used for user-visible object memory, except when the alignment
requirement ends up forcing unused (by both the user and by obmalloc)
pad bytes.  For example, Python ints consume 12 bytes on 32-bit boxes,
but if space were allocated for them by obmalloc (it's not), obmalloc
would have to use 16 bytes per int to preserve 8-byte alignment.
Good note.
...
OTOH, obmalloc (unlike PyGC_Head) has always used 8-byte alignment,
because that's what worked best for Vladimir Marangozov during his
extensive timing tests.  It's not an isolated decision -- e.g., it's
also influenced by, and influences, "the best" pool size, and (of
course) cutting alignment to 4 would double the number of "size
classes" obmalloc has to juggle.
Yes, the maximum quantity of buffers will really be doubled. But it
should not to affect as that productivity of system, or on total amount
of consumed memory. Change of a fragmentation of memory to count it is
impossible, since it will depend on the concrete application.
...
...
...
* To expand the maximal size which can be allocated by means of the
given module. Now the maximal size is 256 bytes.
...
Right. This is somewhat deliberate, though; the expectation is that
fragmentation will increase dramatically if even larger size classes
are supported.
It's entirely deliberate ;-)  obmalloc is no way trying to be a
general-purpose allocator.  It was designed specifically for the
common memory patterns in Python, aiming at large numbers of small
objects of the same size.  That's extremely common in Python, and
allocations larger than 256 bytes are not. The maximum size was
actually 64 bytes at first.    After that, dicts grew an embedded
8-element table, which vastly increased the size of the dict struct. 
obmalloc's limit was boosted to 256 then, although it could have
stopped at the size of a dict (in the rough ballpark of 150 bytes). 
There was no reason (then or now) to go beyond 256.
See below.
...
...
* At the size of the allocated memory close to maximal, the
allocation of blocks becomes inefficient. For example, for the
sizesof blocks 248 and 256 (blocksize), values of quantity of
elements (PAGESIZE/blocksize) it is equal 16. I.e. it would be
possible to use for the sizes of the block 248 same page, as for the
size of the block 256.
...
Good idea; that shouldn't be too difficult to implement: for sizes above
215, pools need to be spaced apart only at 16 bytes.
I'd rather drop the limit to 215 then <0.3 wink>.  Allocations that
large probably still aren't important to obmalloc, but it is important
that finding a requested allocation's "size index" be as cheap as
possible.  Uniformity helps that.
An available module on allocation of memory really does not approach for
overall aims. And speed of work can "blink" in case of existing non-uniform
realization on allocation of memory. In some cases allocation of memory in
the module `obmalloc.c' work as function of a wrapper for standard
malloc/free, that does not raise speed :) .
...
On modern boxes, it may be desirable to boost POOL_SIZE -- nobody has
run timing experiments as comprehensive as Vladimir ran way back when,
and there's no particular reason to believe that the same set of
tradeoffs would look best on current architectures.
As to speed of work it will be not less than in available systems of
allocation of memory. Naturally, there are no pluses without what that of
minuses. In my opinion available minuses can be made insignificant.
Language Python has obviously expressed specificity on allocation of memory
For an example we shall assume, that two pieces of memory are allocated. 
One
is allocated `obmalloc.c'. Another is allocated standard malloc.
Both of a piece are small enough to be realized by similar methods. In a
number of OS malloc will lose in speed of work for multi-thread programs,
because of blocking the global lock on allocation of memory. Certainly it
concerns not to all operational systems.
that it is possible to use favourably, having developed system of 
allocation
of memory is direct under it.

The properties inherent in language Python:
* At work in multi-thread a mode already there is blocking GIL which can be
  used and in case of with allocation of memory.
* Objects which are often used, it is better to adapt on the real size,
  instead of on log2(size(ob)) therefore as function realloc is carried out
  extremely seldom or even less often :) .
* Extremely seldom there are objects the size of the allocated memory
  multiple 512, 1024, 2048 since before them heading Python of object, and
  the parameters describing properties of object is placed still.

First two items concern and to present `obmalloc.c', but for a narrow 
circle
of allocated blocks of memory.

In some interpreters own system of allocation of memory is used. Adjustment
of a control system of memory not the latest step to acceleration Python.

Thanks everyone who has answered. The theme on former is interesting.

#include 
#include 
#include 
#include 
#include 

typedef union __genlink {
	struct {
		union __genlink	*qcl_prev;
		union __genlink	*qcl_next;
	} qcl; /* Queue of Cycle List */
	void	*point[2];
	union	__genlink *link[2];
} genlink;

void
qcl_init(genlink *root) {
	root->qcl.qcl_next = root;
	root->qcl.qcl_prev = root;
}

void
qcl_insert_after(genlink *root, genlink *p) {
	genlink	*next;

	p->qcl.qcl_prev = root;
	p->qcl.qcl_next = (next = root->qcl.qcl_next);
	root->qcl.qcl_next = p;
	next->qcl.qcl_prev = p;
}

#define COUNT	10000000
#define REPEAT	1000

long long
rusage_print(const char *s, struct rusage *oru) {
	struct	rusage ru;
	int	sec;
	int	usec;
	long long ret;

	getrusage(RUSAGE_SELF, &ru);
	sec = ru.ru_utime.tv_sec - oru->ru_utime.tv_sec;
	if ((usec = ru.ru_utime.tv_usec - oru->ru_utime.tv_usec) < 0) {
		sec--;
		usec = 1000000 + usec;
	}
	printf("%s: user time %d.%6.6d\n", s, sec, usec);
	fflush(stdout);
	ret = sec*1000000;
	ret += usec;
	return ret;
}

main() {
	genlink	root;
	genlink	*p;
	genlink	*base;
	int	i;
	int	k;
	struct rusage ru;
	long long tm_normal;
	long long tm_align4;
	long long tm_align2;

	if ((base = malloc((COUNT + 1)*sizeof(genlink))) == NULL) {
		fprintf(stderr, "malloc failed\n");
		exit(1);
	}

	bzero(base, (COUNT + 1)*sizeof(genlink));
	getrusage(RUSAGE_SELF, &ru);
	bzero(base, (COUNT + 1)*sizeof(genlink));
	qcl_init(&root);
	for (i = 0; i < COUNT; i++) {
		p = (genlink *)(((char *)(&base[i])) + sizeof(genlink)); // add compenstion.
		qcl_insert_after(&root, p);
	}
	for (k = 0, p = &root; k < REPEAT; k++)
		for (i = 0; i < COUNT; i++)
			p = p->qcl.qcl_next;
	tm_normal = rusage_print("align for two pointers", &ru);

	bzero(base, (COUNT + 1)*sizeof(genlink));
	getrusage(RUSAGE_SELF, &ru);
	bzero(base, (COUNT + 1)*sizeof(genlink));
	qcl_init(&root);
	for (i = 0; i < COUNT; i++) {
		p = (genlink *)(((char *)(&base[i])) + 4);
		qcl_insert_after(&root, p);
	}
	for (k = 0, p = &root; k < REPEAT; k++)
		for (i = 0; i < COUNT; i++)
			p = p->qcl.qcl_next;
	tm_align4 = rusage_print("align for 4 bytes", &ru);

	bzero(base, (COUNT + 1)*sizeof(genlink));
	getrusage(RUSAGE_SELF, &ru);
	bzero(base, (COUNT + 1)*sizeof(genlink));
	qcl_init(&root);
	for (i = 0; i < COUNT; i++) {
		p = (genlink *)(((char *)(&base[i])) + 2);
		qcl_insert_after(&root, p);
	}
	for (k = 0, p = &root; k < REPEAT; k++)
		for (i = 0; i < COUNT; i++)
			p = p->qcl.qcl_next;
	tm_align2 = rusage_print("align for 2 bytes", &ru);

	printf("          Normal    Align4    Align2\n");
	printf("Normal  %8.2f  %8.2f  %8.2f\n",
		((double)tm_normal*100)/tm_normal,
		((double)tm_normal*100)/tm_align4,
		((double)tm_normal*100)/tm_align2
	);
	printf("Align4  %8.2f  %8.2f  %8.2f\n",
		((double)tm_align4*100)/tm_normal,
		((double)tm_align4*100)/tm_align4,
		((double)tm_align4*100)/tm_align2
	);
	printf("Align2  %8.2f  %8.2f  %8.2f\n",
		((double)tm_align2*100)/tm_normal,
		((double)tm_align2*100)/tm_align4,
		((double)tm_align2*100)/tm_align2
	);
}

At once:
fox:root!~/test > cc aligntest.c; a.out
-----------------------------------------
align for two pointers: user time 42.819181
align for 4 bytes: user time 42.725775
align for 2 bytes: user time 59.198894
          Normal    Align4    Align2
Normal    100.00    100.22     72.33
Align4     99.78    100.00     72.17
Align2    138.25    138.56    100.00
------------------------------------------

At second:
fox:root!~/test > cc -O2 aligntest.c; a.out
------------------------------------------
align for two pointers: user time 10.152062
align for 4 bytes: user time 10.161598
align for 2 bytes: user time 10.145807
          Normal    Align4    Align2
Normal    100.00     99.91    100.06
Align4    100.09    100.00    100.16
Align2     99.94     99.84    100.00
------------------------------------------

XXX:vys!~ > a.out
align for two pointers: user time 0.228938
align for 4 bytes: user time 0.237644
align for 2 bytes: user time 0.235519
          Normal    Align4    Align2
Normal    100.00     96.34     97.21
Align4    103.80    100.00    100.90
Align2    102.87     99.11    100.00
XXX:vys!~ > a.out
align for two pointers: user time 0.254978
align for 4 bytes: user time 0.236627
align for 2 bytes: user time 0.242168
          Normal    Align4    Align2
Normal    100.00    107.76    105.29
Align4     92.80    100.00     97.71
Align2     94.98    102.34    100.00
XXX:vys!~ >