Populating huge data structures from disk
Chris Mellon
arkanes at gmail.com
Tue Nov 6 14:04:57 EST 2007
On Nov 6, 2007 12:18 PM, Michael Bacarella <mbac at gpshopper.com> wrote:
>
>
>
>
> For various reasons I need to cache about 8GB of data from disk into core on
> application startup.
>
Are you sure? On PC hardware, at least, doing this doesn't make any
guarantee that accessing it actually going to be any faster. Is just
mmap()ing the file a problem for some reason?
I assume you're on a 64 bit machine.
> Building this cache takes nearly 2 hours on modern hardware. I am surprised
> to discover that the bottleneck here is CPU.
>
>
>
> The reason this is surprising is because I expect something like this to be
> very fast:
>
>
>
> #!python
>
>
>
> import array
>
> a = array.array('L')
>
> f = open('/dev/zero','r')
>
> while True:
>
> a.fromstring(f.read(8))
>
>
This just creates the same array over and over, forever. Is this
really the code you meant to write? I don't know why you'd expect an
infinite loop to be "fast"...
>
>
>
> Profiling this application shows all of the time is spent inside
> a.fromstring.
>
Obviously, because that's all that's inside your while True loop.
There's nothing else that it could spend time on.
> Little difference if I use list instead of array.
>
>
>
> Is there anything I could tell the Python runtime to help it run this
> pathologically slanted case faster?
>
This code executes in a couple seconds for me (size reduced to fit in
my 32 bit memory space):
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> s = '\x00' * ((1024 **3)/2)
>>> len(s)
536870912
>>> a = array.array('L')
>>> a.fromstring(s)
>>>
You might also want to look at array.fromfile()
More information about the Python-list
mailing list