[Numpy-discussion] Loading a > GB file into array

Fri Dec 21 11:11:51 EST 2007

On Dec 21, 2007 6:45 AM, David Cournapeau <david at ar.media.kyoto-u.ac.jp>
wrote:

> Hans Meine wrote:
> > Am Freitag, 21. Dezember 2007 13:23:49 schrieb David Cournapeau:
> >
> >>> Instead of saying "memmap is ALL about disc access" I would rather
> >>> like to say that "memap is all about SMART disk access" -- what I mean
> >>> is that memmap should run as fast as a normal ndarray if it works on
> >>> the cached part of an array.  Maybe there is a way of telling memmap
> >>> when and what to cache  and when to sync that cache to the disk.
> >>> In other words, memmap should perform just like a in-pysical-memory
> >>> array  -- only that it once-in-a-while saves/load to/from the disk.
> >>> Or is this just wishful thinking ?
> >>> Is there a way of "pre loading" a given part into cache
> >>> (pysical-memory) or prevent disc writes at "bad times" ?
> >>> How about doing the sync from a different thread ;-)
> >>>
> >> mmap is using the OS IO caches, that's kind of the point of using mmap
> >> (at least in this case). Instead of doing the caching yourself, the OS
> >> does it for you, and OS are supposed to be smart about this :)
> >>
> >
> > AFAICS this is what Sebastian wanted to say, but as the OP indicated,
> > preloading e.g. by reading the whole array once did not work for him.
> > Thus, I understand Sebastian's questions as "is it possible to help the
> OS
> > when it is not smart enough?".  Maybe something along the lines of
> mlock,
> > only not quite as aggressive.
> >
> I don't know exactly why it did not work, but it is not difficult to
> imagine why it could fail (when you read a 2 Gb file, it may not be
> smart on average to put the whole file in the buffer, since everything
> else is kicked out). It all depends on the situation, but there are many
> different things which can influence this behaviour: the IO scheduler,
> how smart the VM is, the FS (on linux, some FS are better than others
> for RT audio dsp, and some options are better left out), etc... On
> Linux, using the deadline IO scheduler can help, for example (that's the
> recommended scheduler for IO intensive musical applications).
>
<snip>

>
> But if what you want is to reliable being able to read "in real time" a
> big file which cannot fit in memory, then you need a design where
> something is doing the disk buffering as you want (again, taking the
> example I am somewhat familiar with, in audio processing, you often have
> a IO thread which does the pre-caching, and put the data into mlock'ed
> buffers to another thread, the one which is RT).

IIRC, Martin really wanted something like streaming IO broken up into
smaller frames with previously cached results ideally discarded.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20071221/aa0cc43a/attachment.html>