[Numpy-discussion] object array alignment issues

Fri Oct 16 06:07:10 EDT 2009

A Thursday 15 October 2009 19:00:04 Charles R Harris escrigué:
> > So, how to fix this?
> >
> > One obvious workaround is for users to pass "align=True" to the dtype
> > constructor.  This works if the dtype descriptor is a dictionary or
> > comma-separated string.  Is there a reason it couldn't be made to work
> > with the string-of-tuples form that I'm missing?  It would be marginally
> > more convenient from my application, but that's just a finesse issue.
> >
> > However, perhaps we should try to fix the underlying alignment
> > problems?  Unfortunately, it's not clear to me how to resolve them
> > without at least some performance penalty.  You either do an alignment
> > check of the pointer, and then memcpy if unaligned, or just always use
> > memcpy.  Not sure which is faster, as memcpy may have a fast path
> > already. These are object arrays anyway, so there's plenty of overhead
> > already, and I don't think this would affect regular numerical arrays.

The response is clear: avoid memcpy() if you can.  It is true that memcpy() 
performance has improved quite a lot in latest gcc (it has been quite good in 
Win versions since many years ago), but working with data in-place (i.e. 
avoiding a memory copy) is always faster (and most specially for large arrays 
that don't fit in cache processors).

My own experiments says that, with an Intel Core2 processor the typical speed-
ups for avoiding memcpy() are 2x.  And I've read somewhere that both AMD and 
Intel are trying to make unaligned operations to go even faster in next 
architectures (the goal is that there should be no speed difference in 
accessing aligned or unaligned data).

> I believe the memcpy approach is used for other unaligned parts of void
> types. There is an inherent performance penalty there, but I don't see how
> it can be avoided when using what are essentially packed structures. As to
> memcpy, it's performance seems to depend on the compiler/compiler version,
> old versions of gcc had *horrible* implementations of memcpy. I believe the
> situation has since improved. However, I'm not sure we should be coding to
> compiler issues unless it is unavoidable or the gain is huge.

IMO, NumPy can be improved for unaligned data handling.  For example, Numexpr 
is using this small snippet:

from cpuinfo import cpu
if cpu.is_AMD() or cpu.is_Intel():
    is_cpu_amd_intel = True
else:
    is_cpu_amd_intel = False

for detecting AMD/Intel architectures and allowing the code to avoid memcpy() 
calls for the unaligned arrays.

The above code uses the excellent ``cpuinfo.py`` module from Pearu Peterson, 
which is distributed under NumPy, so it should not be too difficult to take 
advantage of this for avoiding unnecessary copies in this scenario.

-- 
Francesc Alted