[Numpy-discussion] How to make "lazy" derived arrays in a recarray view of a memmap on large files
Kim Hansen
slaunger at gmail.com
Fri Jan 16 06:02:09 EST 2009
Hi numpy forum
I need to efficiently handle some large (300 MB) recordlike binary
files, where some data fields are less than a byte and thus cannot be
mapped in a record dtype immediately.
I would like to be able to access these derived arrays in a memory
efficient manner but I cannot figure out how to acheive this.
My application of the derived arrays would never be to do operation on
the entire array, rather iterate over some selected elements and do
somthing about it - operations which seems well suited for doing on
demand
I wrote a related post yesterday, which I have not received any
response on. I am now posting again using another example and perhaps
more clear example which I beleive describes my problem spot on
from numpy import *
# Python.exe memory use here: 8.14 MB
desc = dtype([("baconandeggs", "<u1"), ("spam","<u1"), ("parrots","<u1")])
index = memmap("g:/id-2008-10-25-17-ver4.idx", dtype = desc,
mode="r").view(recarray)
# The index file is very large, contains 292 MB of data
# Python.exe memory use: 8.16 MB, only 20 kB extra for memmap mapped to recarray
# The following instant operation takes a few secs working on 3*10^8 elements
# How can I derive new array in a lazy/ondemand/memmap manner?
index.bacon = index.baconandeggs >> 4
# python.exe memory use: 595 MB! Not surprising but how to do better??
# Another derived array, which is resource demanding
index.eggs = index.baconandeggs & 0x0F
# python.exe memory usage is now 731 MB!
What I'd like to do is implement a class, LazyBaconEggsSpamParrots,
which encapsulates the
derived arrays
such that I could do
besp = LazyBaconEggsSpamParrots("baconeggsspamparrots.idx")
for b in besp.bacon: #Iterate lazy
spam(b)
#Only derive the 1000 needed elements, don't do all 1000000
dosomething(besp.bacon[1000000:1001000])
I envision the class would look something like this
class LazyBaconEggsSpamParrots(object):
def __init__(self, filename):
desc = dtype([("baconandeggs", "<u1"),
("spam","<u1"),
("parrots","<u1")])
self._data = memmap(filename, dtype=desc, mode='r').view(recarray)
# Expose the one-to-one data directly
self.spam = self._data.spam
self.parrots = self._data.parrots
# This would work but costs way too much memory
# self.bacon = self._data.baconandeggs >> 4
# self.eggs = self._data.baconandeggs & 0x0F
def __getattr__(self, attr_name):
if attr_name == "bacon":
# return bacon in an on demand manner, but how?
elif attr_name == "eggs":
# return eggs in an on demand manner, but how?
else:
# If the name is not a data attribute treat it as a normal
# non-existing attribute - raise AttributeError
raise AttributeError
but how to do the lazy part of it?
-- Kim
More information about the NumPy-Discussion
mailing list