Alignment assumptions

A quick grep-find through the Python-2.2 sources reveals the following: Include/dictobject.h:49: long aligner; Include/objimpl.h:275: double dummy; /* force worst-case alignment */ Modules/addrinfo.h:162: LONG_LONG __ss_align; /* force desired structure storage alignment */ Modules/addrinfo.h:164: double __ss_align; /* force desired structure storage alignment */ At first glance, there appear to be different assumptions at work here about what constitutes maximal alignment on any given platform. I've been using a little C++ metaprogram to find a type which will properly align any other given type. Because of limitations of one compiler, I had to disable the computation and instead used the objimpl.h assumption that double was maximally aligned, but also added a compile-time assertion to check that the alignment is always greater than or equal to that of the target type. Well, it failed today on Tru64 Unix with the latest compaq CXX 6.5 prerelease compiler; it appears that the alignment of long double is greater than that of double on that platform. I thought someone might want to know, Dave +---------------------------------------------------------------+ David Abrahams C++ Booster (http://www.boost.org) O__ == Pythonista (http://www.python.org) c/ /'_ == resume: http://users.rcn.com/abrahams/resume.html (*) \(*) == email: david.abrahams@rcn.com +---------------------------------------------------------------+

[David Abrahams]
A quick grep-find through the Python-2.2 sources reveals the following:
Include/dictobject.h:49: long aligner;
This is in #ifdef USE_CACHE_ALIGNED long aligner; #endif and AFAIK nobody ever defines the symbol. It's a cache-line optimization gimmick, but is effectively a nop (except to waste memory) on "almost all" machines. IIRC, the author never measured any improvement by using it (not surprising, since I believe almost all mallocs at least 8-byte align now). I vote we delete it.
Include/objimpl.h:275: double dummy; /* force worst-case alignment */
One branch of a union, forces enough padding in the gc header so that whatever follows the gc header is "aligned enough". This is sufficient for all core gc types, but may not be sufficient for user-defined gc types. I'm happy enough to view it as a restriction on what user-defined gc'able types can contain.
This isn't our code (it's imported from the WIDE project), and I have no idea what it thinks it's trying to accomplish (neither the mystery padding, nor really much of anything else in the WIDE code!).
At first glance, there appear to be different assumptions at work here about what constitutes maximal alignment on any given platform.
Only the objimpl.h trick might benefit from maximal alignment.
If you ever compile on a KSR machine, you'll discover there's no std C type that captures maximal alignment. You'd have to guess it's an extension type named "_subpage". I'm not sure that even C++ template metaprogramming could manage that bit of channeling <wink> (FYI, _subpage required 128-byte alignment). Stupid trick: If you can compute this at run time, do malloc(1) a few times, count the number of trailing 0 bits in the returned addresses, and take the minimum. Since malloc has to return memory "suitably aligned so that it may be assigned to a pointer to any type of object and then used to access such an object or an array of such objects", you'd soon discover you always got at least 7 trailing zero bits back from KSR malloc(), and presumably at least 4 under Tru64. there's-the-standard-and-then-there's-real-life<wink>-ly y'rs - tim

----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net>
As I read the code, it affects all types (doesn't this header begin every object, regardless of its GC flags?) and I think that's a very unhappy circumstance for your numeric community. Remember, the type that raised the alarm here was just a long double.
I'm not actually after maximal alignment; I look for a minimally-sized/aligned type whose alignment is a multiple of the target type's alignment. In any case, I was just using the assumption that double was maximally aligned since I was linking with Python code and the EDG front-end was too slow to handle the metaprogram -- I figured that if the assumption was good enough for Python and my clients were depending on it anyway, it was good enough for my code (not!).
I was aware that this was a theoretical possibility, but not that it was a practical one. What's KSR?
Nope; we can only look through a list of likely candidates to try to find a match. We're hoping to address this for the next standard -- I'm pushing for allowing non-POD types in unions, leaving construction/destruction up to the user.
(FYI, _subpage required 128-byte alignment).
I guess that strictly speaking, requiring maximal alignment wouldn't be appropriate for objimpl ;-)
Sounds like a good candidate for your autoconf script. Seriously, though, I think it would be reasonable to stick to aligning the standard builtin types, in which can you can do the test without calling malloc, FWIW.
there's-the-standard-and-then-there's-real-life<wink>-ly y'rs - tim
in-theory-theory-and-practice-are-the-same-and-to-hell-with-what-happens-in- practice-ly y'rs -Dave

[Jack, skip to the end please] [David Abrahams, on Include/objimpl.h:275: double dummy; /* force worst-case alignment */ ]
As I read the code, it affects all types (doesn't this header begin every object, regardless of its GC flags?)
Nope, only objects that go through _PyObject_GC_Malloc(). It could be a nightmare if, e.g., every string and int object consumed another (at least) 12 bytes.
The *Python* numeric community is far more likely to embed a float than a long double, and in any case seems unlikely to build a container type mixing long double with PyObject* members (i.e., one that ought to participate in cyclic gc). I expect we have a blind spot towards long double in general since Python doesn't expose or use such a thing, all the developers run on platforms where (as far as they know <wink>) it's the same as a double, and "long double" was introduced after K&R (so some old-timers likely aren't even aware C89 introduced it). But I'll change the code here to use long double instead -- it's harmless, as it doesn't make a lick of difference on any platform that matters <0.7 wink>.
Only the objimpl.h trick might benefit from maximal alignment.
Well, nobody has complained yet, but the core never needs alignment stricter than double, and-- as above --an extension type that both did and needed to participate in GC is unlikey.
and my clients were depending on it anyway, it was good enough for my code (not!).
One of the secrets to Python's success is that we tell unreasonable users to go away and bother the C++ committee instead. [128-byte alignment needed for KSR's _subpage type]
I was aware that this was a theoretical possibility, but not that it was a practical one. What's KSR?
Kendall Square Research, my (and Tani's, Tamah's and Steve Breit's) employer before Dragon. The address space was carved into 128-byte "subpages", and the hardware supported Python-style (non-owned non-reentrant) locks directly on a per-subpage basis (Python's lock.acquire() and lock.release() were one machine instruction each!). Subpages were also the unit for cache coherency across processors. So use of _subpage in our system code, and in speed-obsessed app code, was ubiquitous. I guess the main thing KSR proved was that you can't stay in business designing custom hardware to execute Python's semantics directly <wink>.
I checked this in: long double dummy; /* force worst-case alignment */ [Guido, on #ifdef USE_CACHE_ALIGNED long aligner; #endif ]
The malloc 8-byte align argument doesn't apply, since this struct is used in an array.
I was composing email while asleep <wink>. Gotcha.
Jack, do you still want this? fighting-code-rot-ly y'rs - tim

----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net>
Oh! I guess I should explicitly avoid _PyObject_GC_Malloc() unless I'm supporting GC, then. As you can see, there's a lot of basic stuff I still don't understand.
OK, I get it. I'm still not clear on what happens by default, but I was under the mistaken impression that some types get GC support "automatically" and thus that people would be subject to undesired alignment problems without explicitly choosing them.
Just for the record, I didn't twist your arm about this (only the ends of your moustache).
Makes sense. And I guess because this is 'C', hacking in the appropriate alignment if such a type ever arose wouldn't be that hard.
That explains everything, thank you (especially the oving relationship we have with our lusers)! directly proved
was that you can't stay in business designing custom hardware to execute Python's semantics directly <wink>.
/Please/ tell me you weren't trying to build a parallel Python machine <5.99wink>.

The malloc 8-byte align argument doesn't apply, since this struct is used in an array. Since the struct itself doesn't require alignment beyond 4 bytes, the array entries can be 12 bytes apart. So I don't think this is a nop -- I think it would waste 4 bytes per hash table entry on most machines. This was added by Jack Jansen ages ago -- I think he did measure a speedup on an old Mac compiler, or he wouldn't have added it, and I bet there was a #define USE_CACHE_ALIGNED in his config.h then. But that's all history; I agree it should be deleted. --Guido van Rossum (home page: http://www.python.org/~guido/)

On donderdag, februari 28, 2002, at 07:57 , Tim Peters wrote:
MacPython uses it. At the time it was put in it caused a 15% increase in Pystones because dictionary entries were aligned in cache lines. But: this was in the PPC 601 and 604 era, I must say that I've never tested whether it made any difference on G3 and G4. Put in a bug report in my name, and one day I'll get around to testing whether it still makes a difference on current hardware and rip it out if it doesn't. -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

[David Abrahams]
A quick grep-find through the Python-2.2 sources reveals the following:
Include/dictobject.h:49: long aligner;
This is in #ifdef USE_CACHE_ALIGNED long aligner; #endif and AFAIK nobody ever defines the symbol. It's a cache-line optimization gimmick, but is effectively a nop (except to waste memory) on "almost all" machines. IIRC, the author never measured any improvement by using it (not surprising, since I believe almost all mallocs at least 8-byte align now). I vote we delete it.
Include/objimpl.h:275: double dummy; /* force worst-case alignment */
One branch of a union, forces enough padding in the gc header so that whatever follows the gc header is "aligned enough". This is sufficient for all core gc types, but may not be sufficient for user-defined gc types. I'm happy enough to view it as a restriction on what user-defined gc'able types can contain.
This isn't our code (it's imported from the WIDE project), and I have no idea what it thinks it's trying to accomplish (neither the mystery padding, nor really much of anything else in the WIDE code!).
At first glance, there appear to be different assumptions at work here about what constitutes maximal alignment on any given platform.
Only the objimpl.h trick might benefit from maximal alignment.
If you ever compile on a KSR machine, you'll discover there's no std C type that captures maximal alignment. You'd have to guess it's an extension type named "_subpage". I'm not sure that even C++ template metaprogramming could manage that bit of channeling <wink> (FYI, _subpage required 128-byte alignment). Stupid trick: If you can compute this at run time, do malloc(1) a few times, count the number of trailing 0 bits in the returned addresses, and take the minimum. Since malloc has to return memory "suitably aligned so that it may be assigned to a pointer to any type of object and then used to access such an object or an array of such objects", you'd soon discover you always got at least 7 trailing zero bits back from KSR malloc(), and presumably at least 4 under Tru64. there's-the-standard-and-then-there's-real-life<wink>-ly y'rs - tim

----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net>
As I read the code, it affects all types (doesn't this header begin every object, regardless of its GC flags?) and I think that's a very unhappy circumstance for your numeric community. Remember, the type that raised the alarm here was just a long double.
I'm not actually after maximal alignment; I look for a minimally-sized/aligned type whose alignment is a multiple of the target type's alignment. In any case, I was just using the assumption that double was maximally aligned since I was linking with Python code and the EDG front-end was too slow to handle the metaprogram -- I figured that if the assumption was good enough for Python and my clients were depending on it anyway, it was good enough for my code (not!).
I was aware that this was a theoretical possibility, but not that it was a practical one. What's KSR?
Nope; we can only look through a list of likely candidates to try to find a match. We're hoping to address this for the next standard -- I'm pushing for allowing non-POD types in unions, leaving construction/destruction up to the user.
(FYI, _subpage required 128-byte alignment).
I guess that strictly speaking, requiring maximal alignment wouldn't be appropriate for objimpl ;-)
Sounds like a good candidate for your autoconf script. Seriously, though, I think it would be reasonable to stick to aligning the standard builtin types, in which can you can do the test without calling malloc, FWIW.
there's-the-standard-and-then-there's-real-life<wink>-ly y'rs - tim
in-theory-theory-and-practice-are-the-same-and-to-hell-with-what-happens-in- practice-ly y'rs -Dave

[Jack, skip to the end please] [David Abrahams, on Include/objimpl.h:275: double dummy; /* force worst-case alignment */ ]
As I read the code, it affects all types (doesn't this header begin every object, regardless of its GC flags?)
Nope, only objects that go through _PyObject_GC_Malloc(). It could be a nightmare if, e.g., every string and int object consumed another (at least) 12 bytes.
The *Python* numeric community is far more likely to embed a float than a long double, and in any case seems unlikely to build a container type mixing long double with PyObject* members (i.e., one that ought to participate in cyclic gc). I expect we have a blind spot towards long double in general since Python doesn't expose or use such a thing, all the developers run on platforms where (as far as they know <wink>) it's the same as a double, and "long double" was introduced after K&R (so some old-timers likely aren't even aware C89 introduced it). But I'll change the code here to use long double instead -- it's harmless, as it doesn't make a lick of difference on any platform that matters <0.7 wink>.
Only the objimpl.h trick might benefit from maximal alignment.
Well, nobody has complained yet, but the core never needs alignment stricter than double, and-- as above --an extension type that both did and needed to participate in GC is unlikey.
and my clients were depending on it anyway, it was good enough for my code (not!).
One of the secrets to Python's success is that we tell unreasonable users to go away and bother the C++ committee instead. [128-byte alignment needed for KSR's _subpage type]
I was aware that this was a theoretical possibility, but not that it was a practical one. What's KSR?
Kendall Square Research, my (and Tani's, Tamah's and Steve Breit's) employer before Dragon. The address space was carved into 128-byte "subpages", and the hardware supported Python-style (non-owned non-reentrant) locks directly on a per-subpage basis (Python's lock.acquire() and lock.release() were one machine instruction each!). Subpages were also the unit for cache coherency across processors. So use of _subpage in our system code, and in speed-obsessed app code, was ubiquitous. I guess the main thing KSR proved was that you can't stay in business designing custom hardware to execute Python's semantics directly <wink>.
I checked this in: long double dummy; /* force worst-case alignment */ [Guido, on #ifdef USE_CACHE_ALIGNED long aligner; #endif ]
The malloc 8-byte align argument doesn't apply, since this struct is used in an array.
I was composing email while asleep <wink>. Gotcha.
Jack, do you still want this? fighting-code-rot-ly y'rs - tim

----- Original Message ----- From: "Tim Peters" <tim.one@comcast.net>
Oh! I guess I should explicitly avoid _PyObject_GC_Malloc() unless I'm supporting GC, then. As you can see, there's a lot of basic stuff I still don't understand.
OK, I get it. I'm still not clear on what happens by default, but I was under the mistaken impression that some types get GC support "automatically" and thus that people would be subject to undesired alignment problems without explicitly choosing them.
Just for the record, I didn't twist your arm about this (only the ends of your moustache).
Makes sense. And I guess because this is 'C', hacking in the appropriate alignment if such a type ever arose wouldn't be that hard.
That explains everything, thank you (especially the oving relationship we have with our lusers)! directly proved
was that you can't stay in business designing custom hardware to execute Python's semantics directly <wink>.
/Please/ tell me you weren't trying to build a parallel Python machine <5.99wink>.

The malloc 8-byte align argument doesn't apply, since this struct is used in an array. Since the struct itself doesn't require alignment beyond 4 bytes, the array entries can be 12 bytes apart. So I don't think this is a nop -- I think it would waste 4 bytes per hash table entry on most machines. This was added by Jack Jansen ages ago -- I think he did measure a speedup on an old Mac compiler, or he wouldn't have added it, and I bet there was a #define USE_CACHE_ALIGNED in his config.h then. But that's all history; I agree it should be deleted. --Guido van Rossum (home page: http://www.python.org/~guido/)

On donderdag, februari 28, 2002, at 07:57 , Tim Peters wrote:
MacPython uses it. At the time it was put in it caused a 15% increase in Pystones because dictionary entries were aligned in cache lines. But: this was in the PPC 601 and 604 era, I must say that I've never tested whether it made any difference on G3 and G4. Put in a bug report in my name, and one day I'll get around to testing whether it still makes a difference on current hardware and rip it out if it doesn't. -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -
participants (4)
-
David Abrahams
-
Guido van Rossum
-
Jack Jansen
-
Tim Peters