[Numpy-discussion] How to make "lazy" derived arrays in a recarray view of a memmap on large files

Fri Jan 16 06:02:09 EST 2009

Hi numpy forum

I need to efficiently handle some large (300 MB) recordlike binary
files, where some data fields are less than a byte and thus cannot be
mapped in a record dtype immediately.

I would like to be able to access these derived arrays in a memory
efficient manner but I cannot figure out how to acheive this.

My application of the derived arrays would never be to do operation on
the entire array, rather iterate over some selected elements and do
somthing about it - operations which seems well suited for doing on
demand

I wrote a related post yesterday, which I have not received any
response on. I am now posting again using another example and perhaps
more clear example which I beleive describes my problem spot on

from numpy import *

# Python.exe memory use here: 8.14 MB
desc = dtype([("baconandeggs", "<u1"), ("spam","<u1"), ("parrots","<u1")])
index = memmap("g:/id-2008-10-25-17-ver4.idx", dtype = desc,
mode="r").view(recarray)
# The index file is very large, contains 292 MB of data
# Python.exe memory use: 8.16 MB, only 20 kB extra for memmap mapped to recarray

# The following instant operation takes a few secs working on 3*10^8 elements
# How can I derive new array in a lazy/ondemand/memmap manner?
index.bacon = index.baconandeggs >> 4
# python.exe memory use: 595 MB! Not surprising but how to do better??

# Another derived array, which is resource demanding
index.eggs = index.baconandeggs & 0x0F
# python.exe memory usage is now 731 MB!

What I'd like to do is implement a class, LazyBaconEggsSpamParrots,
which encapsulates the
derived arrays

such that I could do

besp = LazyBaconEggsSpamParrots("baconeggsspamparrots.idx")
for b in besp.bacon: #Iterate lazy
    spam(b)
#Only derive the 1000 needed elements, don't do all 1000000
dosomething(besp.bacon[1000000:1001000])

I envision the class would look something like this

class LazyBaconEggsSpamParrots(object):

    def __init__(self, filename):
         desc = dtype([("baconandeggs", "<u1"),
                       ("spam","<u1"),
                       ("parrots","<u1")])
         self._data = memmap(filename, dtype=desc, mode='r').view(recarray)
         # Expose the one-to-one data directly
         self.spam = self._data.spam
         self.parrots = self._data.parrots
         # This would work but costs way too much memory
         # self.bacon = self._data.baconandeggs >> 4
         # self.eggs = self._data.baconandeggs & 0x0F

    def __getattr__(self, attr_name):
        if attr_name == "bacon":
            # return bacon in an on demand manner, but how?
        elif attr_name == "eggs":
            # return eggs in an on demand manner, but how?
        else:
            # If the name is not a data attribute treat it as a normal
            # non-existing attribute - raise AttributeError
            raise AttributeError

but how to do the lazy part of it?

-- Kim