Populating huge data structures from disk

Chris Mellon arkanes at gmail.com
Tue Nov 6 23:43:38 CET 2007


On Nov 6, 2007 3:42 PM, Michael Bacarella <mbac at gpshopper.com> wrote:
>
> > Note that you're not doing the same thing at all. You're
> > pre-allocating the array in the C code, but not in Python (and I don't
> > think you can). Is there some reason you're growing a 8 gig array 8
> > bytes at a time?
> >
> > They spend about the same amount of time in system, but Python spends 4.7x
> > as much
> > CPU in userland as C does.
> >
> > Python has to grow the array. It's possible that this is tripping a
> > degenerate case in the gc behavior also (I don't know if array uses
> > PyObjects for its internal buffer), and if it is you'll see an
> > improvement by disabling GC.
>
> That does explain why it's consuming 4.7x as much CPU.
>
> > > x = lengthy_number_crunching()
> > > magic.save_mmap("/important-data")
> > >
> > > and in the application do...
> > >
> > > x = magic.mmap("/important-data")
> > > magic.mlock("/important-data")
> > >
> > > and once the mlock finishes bringing important-data into RAM, at
> > > the speed of your disk I/O subsystem, all accesses to x will be
> > > hits against RAM.
> >
> > You've basically described what mmap does, as far as I can tell. Have
> > you tried just mmapping the file?
>
> Yes, that would be why my fantasy functions have 'mmap' in their names.
>
> However, in C you can mmap arbitrarily complex data structures

Well, for certain limited values of "arbitrary", but okay. It's true
that you can cast pointers into the mmapped region into, say, pointers
to a struct of structs.

The Python equivalent would be to have your Python classes be wrappers
around access to this memory buffer, calculating the offset needed to
get any particular field. ctypes.Structure is probably a good starting
point if you want to implement this. Sadly, it looks like mmap.mmap()
doesn't expose the address of its buffer so you'll either need to use
the C  mmap (via ctypes, probably) or use array.array to load the
bytes and use it's bufferinfo() to get the address to load your
structs from.

>whereas
> in Python all you can mmap without transformations is an array or a string.

Read "array or string" as "stream of bytes" in this context.

> I didn't say this earlier, but I do need to pull more than arrays
> and strings into RAM.  Not being able to pre-allocate storage is a big
> loser for this approach.
>

It is a little annoying that there's no way to pre-allocate an array.
It doesn't over-allocate, either, so building on a few bytes at a time
is pretty much worst case behavior.



More information about the Python-list mailing list