[Numpy-discussion] Aligning an array on Windows

24 May 2007

      Hi,

Some time ago I made an improvement in speed on the numexpr version of
PyTables so as to accelerate the operations with unaligned arrays
(objects that can appear quite commonly when dealing with columns of
recarrays, as PyTables does).

This improvement has demostrated to work correctly and flawlessly in
Linux machines (using GCC 4.x and in both 32-bit and 64-bit Linux boxes)
for several weeks of intensive testing.  Moreover, its speed-up is
ranging from a 40% on modern processors and up to a 70% in older ones,
so I'd like to keep it.

The surprise came today when I tried to compile the same code on a
Windows box (Win XP Service Pack 2) using MSVC 7.1, through the free (as
in beer) Toolkit 2003.  The compilation process went fine, but the
problem is that I'm getting crashes from time to time when running the
numexpr test suite.

After some in-depth investigation, I'm pretty sure that the problem is
in a concrete part of the code that I'd modified for this improvement.
IMO, the affected code is in numexpr/interp_body.c and reads like:

    case OP_COPY_II: VEC_ARG1(memcpy(dest, x1, sizeof(int));
                              dest += sizeof(int); x1 += stride1);
    case OP_COPY_LL: VEC_ARG1(memcpy(dest, x1, sizeof(long long));
                              dest += sizeof(long long); x1 += stride1);
    case OP_COPY_FF: VEC_ARG1(memcpy(dest, x1, sizeof(double));
                              dest += sizeof(double); x1 += stride1);
    case OP_COPY_CC: VEC_ARG1(memcpy(dest, x1, sizeof(double)*2);
                              dest += sizeof(double)*2; x1 += stride1);

This might seem complicated, but it is not.  Each of the OP_COPY_XX is a
function that has to copy source (x1) to destination (dest) for int,
long long, double and complex data types (this function will be called
in a loop for copying all the data in array).  The code for the original
numexpr reads as:

    case OP_COPY_BB: VEC_ARG1(b_dest = b1);
    case OP_COPY_II: VEC_ARG1(i_dest = i1);
    case OP_COPY_FF: VEC_ARG1(f_dest = f1);
    case OP_COPY_CC: VEC_ARG1(cr_dest = c1r;
                              ci_dest = c1i);

i.e. the copy is done through direct assignment.  This can be done
because, in the original numexpr, an array is always guaranteed to reach
this part of the code (the computing kernel) in the aligned form.  But
in my code, this is not guaranteed (the copy is made precisely for
alignment purposes), so this is why I need to make use of memcpy/memmove
calls.

The thing I don't see is why my version of the code can create problems
on Windows platforms and work perfectly on Linux ones.  I've tried to
use memmove instead of memcpy, but the problem persists.

I've had a look at how numpy makes an 'aligned' copy of an unaligned
array, and it seems to me that it uses memcpy/memmove (not sure when you
should use one or another) just as I use it above.

It might be possible that the problem is in another place, but
my tests reaffirm me in the possibility that something is wrong with my
copy code above (but again, I can't see where).

Of course, we can get rid of this optimization but it is a bit
depressing to have renounce to it just because it doesn't work on
Windows :(

Thanks in advance for any hint you may provide!

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth