Re: [Numpy-discussion] Loading a > GB file into array

Dec. 21, 2007


      On Dec 21, 2007 6:45 AM, David Cournapeau <david@ar.media.kyoto-u.ac.jp>
wrote:
...
Hans Meine wrote:
...
Am Freitag, 21. Dezember 2007 13:23:49 schrieb David Cournapeau:
...
...
Instead of saying "memmap is ALL about disc access" I would rather
like to say that "memap is all about SMART disk access" -- what I mean
is that memmap should run as fast as a normal ndarray if it works on
the cached part of an array.  Maybe there is a way of telling memmap
when and what to cache  and when to sync that cache to the disk.
In other words, memmap should perform just like a in-pysical-memory
array  -- only that it once-in-a-while saves/load to/from the disk.
Or is this just wishful thinking ?
Is there a way of "pre loading" a given part into cache
(pysical-memory) or prevent disc writes at "bad times" ?
How about doing the sync from a different thread ;-)
mmap is using the OS IO caches, that's kind of the point of using mmap
(at least in this case). Instead of doing the caching yourself, the OS
does it for you, and OS are supposed to be smart about this :)
AFAICS this is what Sebastian wanted to say, but as the OP indicated,
preloading e.g. by reading the whole array once did not work for him.
Thus, I understand Sebastian's questions as "is it possible to help the
OS
when it is not smart enough?".  Maybe something along the lines of
mlock,
only not quite as aggressive.
I don't know exactly why it did not work, but it is not difficult to
imagine why it could fail (when you read a 2 Gb file, it may not be
smart on average to put the whole file in the buffer, since everything
else is kicked out). It all depends on the situation, but there are many
different things which can influence this behaviour: the IO scheduler,
how smart the VM is, the FS (on linux, some FS are better than others
for RT audio dsp, and some options are better left out), etc... On
Linux, using the deadline IO scheduler can help, for example (that's the
recommended scheduler for IO intensive musical applications).
<snip>
...
But if what you want is to reliable being able to read "in real time" a
big file which cannot fit in memory, then you need a design where
something is doing the disk buffering as you want (again, taking the
example I am somewhat familiar with, in audio processing, you often have
a IO thread which does the pre-caching, and put the data into mlock'ed
buffers to another thread, the one which is RT).
IIRC, Martin really wanted something like streaming IO broken up into
smaller frames with previously cached results ideally discarded.

Chuck

Re: [Numpy-discussion] Loading a > GB file into array

Charles R Harris