[Numpy-discussion] first recarray steps

Vincent Schut schut at sarvision.nl
Thu May 22 03:35:01 EDT 2008


Anne Archibald wrote:
> 2008/5/21 Vincent Schut <schut at sarvision.nl>:
>> Christopher Barker wrote:
>>> Also, if you image data is rgb, usually, that's a (width, height, 3)
>>> array: rgbrgbrgbrgb... in memory. If you have a (3, width, height)
>>> array, then that's rrrrrrr....gggggggg......bbbbbbbb. Some image libs
>>> may give you that, I'm not sure.
>> My data is. In fact, this is a simplification of my situation; I'm
>> processing satellite data, which usually has more (and other) bands than
>> just rgb. But the data is definitely in shape (bands, y, x).
> 
> You may find your life becomes easier if you transpose the data in
> memory. This can make a big difference to efficiency. Years ago I was
> working with enormous (by the standards of the day) MATLAB files on
> disk, storing complex data. The way (that version of) MATLAB
> represented complex data was the way you describe: matrix of real
> parts, matrix of imaginary parts. This meant that to draw a single
> pixel, the disk needed to seek twice... depending on what sort of
> operations you're doing, transposing your data so that each pixel is
> all in one place may improve cache coherency as well as making the use
> of record arrays possible.
> 
> Anne

Anne, thanks for the thoughts. In most cases, you'll probably be right. 
In this case, however, it won't give me much (if any) speedup, maybe 
even slowdown. Satellite images often are stored on disk in a band 
sequential manner. The library I use for IO is GDAL, which is a higly 
optimized c library for reading/writing almost any kind of satellite 
data type. It also features an internal caching mechanism. And it gives 
me my data as (y, x, bands).
I'm not reading single pixels anyway. The amounts of data I have to 
process (enormous, even by the standards of today ;-)) require me to do 
this in chunks, in parallel, even on different cores/cpu's/computers. 
Every chunk usually is (chunkYSize, chunkXSize, allBands) with xsize and 
ysize being not so small (think from 64^2 to 1024^2) so that pretty much 
eliminates any performance issues regarding the data on disk. 
Furthermore, having to process on multiple computers forces me to have 
my data on networked storage. The latency and transfer rate of the 
network will probably eliminate any small speedup because my drive has 
to do less seeks...
Now for the recarray part, that would indeed ease my life a bit :) 
However, having to transpose the data in memory on every read and write 
does not sound very attractive. It will spoil cycles, and memory, and be 
asking for bugs. I can live without recarrays, for sure. I only hoped 
they might make my live a bit easier and my code a bit more readable, 
without too much effort. Well, they won't, apparently... I'll just go on 
like I did before this little excercise.

Thanks all for the inputs.

Cheers,
Vincent.




More information about the NumPy-Discussion mailing list