Loading a > GB file into array
I need to load a 1.3GB binary file entirely into a single numpy.uint8 array. I've been using numpy.fromfile(), but for files > 1.2GB on my win32 machine, I get a memory error. Actually, since I have several other python modules imported at the same time, including pygame, I get a "pygame parachute" and a segfault that dumps me out of python: data = numpy.fromfile(f, numpy.uint8) # where f is the open file 1382400000 items requested but only 0 read Fatal Python error: (pygame parachute) Segmentation Fault If I stick to just doing it at the interpreter with only numpy imported, I can open up files that are roughly 100MB bigger, but any more than that and I get a clean MemoryError. This machine has 2GB of RAM. I've tried setting the /3GB switch on winxp bootup, as well as all the registry suggestions at http://www.msfn.org/board/storage-process-command-t62001.html. No luck. I get the same error in (32bit) ubuntu for a sufficiently big file. I find that if I load the file in two pieces into two arrays, say 1GB and 0.3GB respectively, I can avoid the memory error. So it seems that it's not that windows can't allocate the memory, just that it can't allocate enough contiguous memory. I'm OK with this, but for indexing convenience, I'd like to be able to treat the two arrays as if they were one. Specifically, this file is movie data, and the array I'd like to get out of this is of shape (nframes, height, width). Right now I'm getting two arrays that are something like (0.8*nframes, height, width) and (0.2*nframes, height, width). Later in the code, I only need to index over the 0th dimension, i.e. the frame index. I'd like to access all the data using a single range of frame indices. Is there any way to combine these two arrays into what looks like a single array, without having to do any copying within memory? I've tried using numpy.concatenate(), but that gives me a MemoryError because, I presume, it's doing a copy. Would it be better to load the file one frame at a time, generating nframes arrays of shape (height, width), and sticking them consecutively in a python list? I'm using numpy 1.0.4 (compiled from source tarball with Intel's MKL library) on python 2.5.1 in winxp. Thanks for any advice, Martin
On Nov 30, 2007 2:47 AM, Martin Spacek <numpy@mspacek.mm.st> wrote:
I need to load a 1.3GB binary file entirely into a single numpy.uint8 array. I've been using numpy.fromfile(), but for files > 1.2GB on my win32 machine, I get a memory error. Actually, since I have several other python modules imported at the same time, including pygame, I get a "pygame parachute" and a segfault that dumps me out of python:
data = numpy.fromfile(f, numpy.uint8) # where f is the open file
1382400000 items requested but only 0 read Fatal Python error: (pygame parachute) Segmentation Fault
You might try numpy.memmap -- others have had success with it for large files (32 bit should be able to handle a 1.3 GB file, AFAIK). See for example: http://www.thescripts.com/forum/thread654599.html Kurt
If I stick to just doing it at the interpreter with only numpy imported, I can open up files that are roughly 100MB bigger, but any more than that and I get a clean MemoryError. This machine has 2GB of RAM. I've tried setting the /3GB switch on winxp bootup, as well as all the registry suggestions at http://www.msfn.org/board/storage-process-command-t62001.html. No luck. I get the same error in (32bit) ubuntu for a sufficiently big file.
I find that if I load the file in two pieces into two arrays, say 1GB and 0.3GB respectively, I can avoid the memory error. So it seems that it's not that windows can't allocate the memory, just that it can't allocate enough contiguous memory. I'm OK with this, but for indexing convenience, I'd like to be able to treat the two arrays as if they were one. Specifically, this file is movie data, and the array I'd like to get out of this is of shape (nframes, height, width). Right now I'm getting two arrays that are something like (0.8*nframes, height, width) and (0.2*nframes, height, width). Later in the code, I only need to index over the 0th dimension, i.e. the frame index.
I'd like to access all the data using a single range of frame indices. Is there any way to combine these two arrays into what looks like a single array, without having to do any copying within memory? I've tried using numpy.concatenate(), but that gives me a MemoryError because, I presume, it's doing a copy. Would it be better to load the file one frame at a time, generating nframes arrays of shape (height, width), and sticking them consecutively in a python list?
I'm using numpy 1.0.4 (compiled from source tarball with Intel's MKL library) on python 2.5.1 in winxp.
Thanks for any advice,
Martin
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Kurt Smith wrote:
You might try numpy.memmap -- others have had success with it for large files (32 bit should be able to handle a 1.3 GB file, AFAIK).
Yeah, I looked into numpy.memmap. Two issues with that. I need to eliminate as much disk access as possible while my app is running. I'm displaying stimuli on a screen at 200Hz, so I have up to 5ms for each movie frame to load before it's too late and it drops a frame. I'm sort of faking a realtime OS on windows by setting the process priority really high. Disk access in the middle of that causes frames to drop. So I need to load the whole file into physical RAM, although it need not be contiguous. memmap doesn't do that, it loads on the fly as you index into the array, which drops frames, so that doesn't work for me. The 2nd problem I had with memmap was that I was getting a WindowsError related to memory:
data = np.memmap(1.3GBfname, dtype=np.uint8, mode='r')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\bin\Python25\Lib\site-packages\numpy\core\memmap.py", line 67, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) WindowsError: [Error 8] Not enough storage is available to process this command This was for the same 1.3GB file. This is different from previous memory errors I mentioned. I don't get this on ubuntu. I can memmap a file up to 2GB on ubuntu no problem, but any larger than that and I get this:
data = np.memmap(2.1GBfname, dtype=np.uint8, mode='r')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/numpy/core/memmap.py", line 67, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) OverflowError: cannot fit 'long' into an index-sized integer The OverflowError is on the bytes argument. If I try doing the mmap.mmap directly in Python, I get the same error. So I guess it's due to me running 32bit ubuntu. Martin
Kurt Smith wrote:
You might try numpy.memmap -- others have had success with it for large files (32 bit should be able to handle a 1.3 GB file, AFAIK).
Yeah, I looked into numpy.memmap. Two issues with that. I need to eliminate as much disk access as possible while my app is running. I'm displaying stimuli on a screen at 200Hz, so I have up to 5ms for each movie frame to load before it's too late and it drops a frame. I'm sort of faking a realtime OS on windows by setting the process priority really high. Disk access in the middle of that causes frames to drop. So I need to load the whole file into physical RAM, although it need not be contiguous. memmap doesn't do that, it loads on the fly as you index into the array, which drops frames, so that doesn't work for me. If you want to do it 'properly', it will be difficult, specially in
Martin Spacek wrote: python, specially on windows. This looks really similar to the problem of direct to disk recording, that is you record audio signals from the soundcard into the hard-drive (think recording a concert), and the proper design, at least on linux and mac os X, is to have several threads, one for the IO, one for any computation you may want to do which do not block on any condition, etc... and use special OS facilities (FIFO scheduling, lock pages into physical ram, etc...) as well as some special construct (lock-free ring buffers). This design works relatively well for musical applications, where the data has the same order of magnitude than what you are talking about, and the same kind of latency order (a few ms). This may be overkill for your application, though.
The 2nd problem I had with memmap was that I was getting a WindowsError related to memory:
data = np.memmap(1.3GBfname, dtype=np.uint8, mode='r')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\bin\Python25\Lib\site-packages\numpy\core\memmap.py", line 67, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) WindowsError: [Error 8] Not enough storage is available to process this command
This was for the same 1.3GB file. This is different from previous memory errors I mentioned. I don't get this on ubuntu. I can memmap a file up to 2GB on ubuntu no problem, but any larger than that and I get this:
data = np.memmap(2.1GBfname, dtype=np.uint8, mode='r')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/numpy/core/memmap.py", line 67, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) OverflowError: cannot fit 'long' into an index-sized integer
The OverflowError is on the bytes argument. If I try doing the mmap.mmap directly in Python, I get the same error. So I guess it's due to me running 32bit ubuntu.
Yes. 32 bits means several things in this context: you have 32 bits for the virtual address space, but part of it is reserved for the kernel: this is configurable on linux, and there is a switch for this on windows. By default, on windows, it is split into half: 2 Gb for the kernel, 2 Gb for userspace. On linux, it depends on the distribution(on ubuntu, it seems that the default is 3G for user space, 1 G for kernel space, by reading the build config). I think part of the problem (memory error) is related to this, at least on windows. But the error you get above is easier to understand: an integer is 32 bits, but since it is signed, you cannot address more than 2^31 different locations with an integer. That's why with standard (ansi C stlib) functions related to files, you cannot access more than 2Gb; you need special api for that. In your case, because you cannot code your value with a signed 32 bits integer, you get this error, I guess (index-sized integer means signed integer, I guess). But even if it succeeded, you would be caught by the above problem (if you only have 2 Gb user space for virtual adressing, I don't think you can do a mmap with a size which is more than that, since the whole mapping is done at once; I am not so knowledgeable about OS, though, so I may be totally wrong on this). cheers, David
On Dec 1, 2007 12:09 AM, Martin Spacek <numpy@mspacek.mm.st> wrote:
Kurt Smith wrote:
You might try numpy.memmap -- others have had success with it for large files (32 bit should be able to handle a 1.3 GB file, AFAIK).
Yeah, I looked into numpy.memmap. Two issues with that. I need to eliminate as much disk access as possible while my app is running. I'm displaying stimuli on a screen at 200Hz, so I have up to 5ms for each movie frame to load before it's too late and it drops a frame. I'm sort of faking a realtime OS on windows by setting the process priority really high. Disk access in the middle of that causes frames to drop. So I need to load the whole file into physical RAM, although it need not be contiguous. memmap doesn't do that, it loads on the fly as you index into the array, which drops frames, so that doesn't work for me.
The 2nd problem I had with memmap was that I was getting a WindowsError related to memory:
data = np.memmap(1.3GBfname, dtype=np.uint8, mode='r')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\bin\Python25\Lib\site-packages\numpy\core\memmap.py", line 67, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) WindowsError: [Error 8] Not enough storage is available to process this command
This was for the same 1.3GB file. This is different from previous memory errors I mentioned. I don't get this on ubuntu. I can memmap a file up to 2GB on ubuntu no problem, but any larger than that and I get this:
data = np.memmap(2.1GBfname, dtype=np.uint8, mode='r')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/site-packages/numpy/core/memmap.py", line 67, in __new__ mm = mmap.mmap(fid.fileno(), bytes, access=acc) OverflowError: cannot fit 'long' into an index-sized integer
The OverflowError is on the bytes argument. If I try doing the mmap.mmap directly in Python, I get the same error. So I guess it's due to me running 32bit ubuntu.
Hi, reading this thread I have two comments. a) *Displaying* at 200Hz probably makes little sense, since humans would only see about max. of 30Hz (aka video frame rate). Consequently you would want to separate your data frame rate, that (as I understand) you want to save data to disk and - asynchrounously - "display as many frames as you can" (I have used pyOpenGL for this with great satisfaction) b) To my knowledge, any OS Linux, Windows an OSX can max. allocate about 1GB of data - assuming you have a 32 bit machine. The actual numbers I measured varied from about 700MB to maybe 1.3GB. In other words, you would be right at the limit. (For 64bit, you would have to make sure ALL parts are 64bit, e.g. The python version must be >=2.5, python must have been compiled using a 64-bit compiler *and* the windows version (XP-64)) This holds true the same for physical ram allocation and for memmap allocation. My solution to this was to "wait" for the 64bit .... not tested yet ;-) Cheers, Sebastian Haase
Sebastian Haase wrote:
reading this thread I have two comments. a) *Displaying* at 200Hz probably makes little sense, since humans would only see about max. of 30Hz (aka video frame rate). Consequently you would want to separate your data frame rate, that (as I understand) you want to save data to disk and - asynchrounously - "display as many frames as you can" (I have used pyOpenGL for this with great satisfaction)
Hi Sebastian, Although 30Hz looks pretty good, if you watch a 60fps movie, you can easily tell the difference. It's much smoother. Try recording AVIs on a point and shoot digital camera, if you have one that can do both 30fps and 60fps (like my fairly old Canon SD200). And that's just perception. We're doing neurophysiology, recording from neurons in the visual cortex, which can phase lock to CRT screen rasters up to 100Hz. This is an artifact we don't want to deal with, so we use a 200Hz monitor. I need to be certain of exactly what's on the monitor on every refresh, ie every 5ms, so I run python (with Andrew Straw's package VisionEgg) as a "realtime" priority process in windows on a dual core computer, which lets me reliably update the video frame buffer in time for the next refresh, without having to worry about windows multitasking butting in and stealing CPU cycles for the next 15-20ms. Python runs on one core in "realtime", windows does its junk on the other core. Right now, every 3rd video refresh (ie every 15ms, which is 66.7 Hz, close to the original 60fps the movie was recorded at) I update with a new movie frame. That update needs to happen in less than 5ms, every time. If there's any disk access involved during the update, it inevitably exceeds that time limit, so I have to have it all in RAM before playback begins. Having a second I/O thread running on the second core would be great though. -- Martin
On Sun, Dec 02, 2007 at 05:22:49PM -0800, Martin Spacek wrote:
so I run python (with Andrew Straw's package VisionEgg) as a "realtime" priority process in windows on a dual core computer, which lets me reliably update the video frame buffer in time for the next refresh, without having to worry about windows multitasking butting in and stealing CPU cycles for the next 15-20ms.
Very interesting. Have you made measurements to see how many times you lost one of your cycles. I made these kind of measurements on Linux using the real-time clock with C and it was very interesting ( http://www.gael-varoquaux.info/computers/real-time ). I want to redo them with Python, as I except to have similar results with Python. It would be interesting to see how Windows fits in the picture (I know nothing about Windows, so I really can't make measurements on Windows). Cheers, Gaël
Gael Varoquaux wrote:
Very interesting. Have you made measurements to see how many times you lost one of your cycles. I made these kind of measurements on Linux using the real-time clock with C and it was very interesting ( http://www.gael-varoquaux.info/computers/real-time ). I want to redo them with Python, as I except to have similar results with Python. It would be interesting to see how Windows fits in the picture (I know nothing about Windows, so I really can't make measurements on Windows).
Neat, thanks for that, I'll have a look. I'm very slowly transitioning my computing "life" over to linux, but I've been told by Andrew Straw (I think http://visionegg.org might have some details, see the mailing list) that it's harder to get close to a real-time OS with linux (while running high level stuff like opengl and python) than it is in windows. I hope that's changed, or is changing. I'd love to switch over to 64-bit linux. As far as windows is considered, I'd like 32bit winxp to be my last iteration. Martin
Martin Spacek wrote:
Gael Varoquaux wrote:
Very interesting. Have you made measurements to see how many times you lost one of your cycles. I made these kind of measurements on Linux using the real-time clock with C and it was very interesting ( http://www.gael-varoquaux.info/computers/real-time ). I want to redo them with Python, as I except to have similar results with Python. It would be interesting to see how Windows fits in the picture (I know nothing about Windows, so I really can't make measurements on Windows).
Neat, thanks for that, I'll have a look. I'm very slowly transitioning my computing "life" over to linux, but I've been told by Andrew Straw (I think http://visionegg.org might have some details, see the mailing list) that it's harder to get close to a real-time OS with linux (while running high level stuff like opengl and python) than it is in windows. My impression is that is is more like the contrary; linux implements many posix facilities for more 'real-time' behaviour: it implements a FIFO scheduler, you have mlock facilities to avoid paging, etc... and of course, you can control your environment much more easily (one buggy driver can kill the whole thing as far as latency is concerned, for example). I did not find those info you are talking about on visionegg ?
Now, for python, this is a different matter. In you need to do things in real-time, setting a high priority is not enough, and python has several characteristics which make it less than suitable for real-time (heavy usage of memory allocation, garbage collector, etc...). I guess that when you do things every few ms, with enough memory (every ms gives you millions of cycle on modern machines), you can hope that is it not too much of a problem (at least for memory allocation; I could not find in Andrew's slides whether he disabled the GC). But I doubt you can do much better. I wonder if python can be compiled with a real time memory allocator, and even whether it makes sense at all (I am thinking about something like TLSF: http://rtportal.upv.es/rtmalloc/).
I hope that's changed, or is changing. I'd love to switch over to 64-bit linux. As far as windows is considered, I'd like 32bit winxp to be my last iteration.
With recent kernels, you can get really good latency if you do it right (around 1-2 ms worst case under high load, including high IO pressure). I know nothing about video programming, but I would guess that as far as the kernel is concerned, this does not change much. I have not tried them myself, but ubuntu studio has its own kernel with 'real-time' patched (voluntary preempt from Ingo, for example), and is available both for 32 and 64 bits architectures. One problem I can think of for video is that if you need binary-only drivers: those are generally pretty bad as far as low latency is concerned (nvidia drivers always cause some kind of problems with low latency and 'real-time' kernels). http://ubuntustudio.org/ David
On Tue, Dec 04, 2007 at 02:13:53PM +0900, David Cournapeau wrote:
With recent kernels, you can get really good latency if you do it right (around 1-2 ms worst case under high load, including high IO pressure).
As you can see on my page, I indeed measured less than 1ms latency on Linux under load with kernel more than a year old. These things have gotten much better recently and with a premptible kernel you should be able to get 1ms easily. Going below 0.5ms without using a realtime OS (ie a realtime kernel, under linux) is really pushing it. Cheers, Gaël
Gael Varoquaux wrote:
On Tue, Dec 04, 2007 at 02:13:53PM +0900, David Cournapeau wrote:
With recent kernels, you can get really good latency if you do it right (around 1-2 ms worst case under high load, including high IO pressure).
As you can see on my page, I indeed measured less than 1ms latency on Linux under load with kernel more than a year old. These things have gotten much better recently and with a premptible kernel you should be able to get 1ms easily. Going below 0.5ms without using a realtime OS (ie a realtime kernel, under linux) is really pushing it.
Yes, 1ms is possible for quite a long time; the problem was how to get there (kernel patches, special permissions, etc... Many of those problems are now gone). I've read that you could get around 0.2 ms and even below (worst case) with the last kernels + RT preempt (that is you still use linux, and not rtlinux). Below 1 ms does not make much sense for audio applications, so I don't know much below this range :) But I am really curious if you can get those numbers with python, because of malloc, the gc and co. I mean for example, 0.5 ms latency for a 1 Ghz CPU means that you get something like a 500 000 CPU cycles, and I can imagine a cycle of garbage collection taking that many cycles, without even considering pages of virtual memory which are swapped (in this case, we are talking millions of cycles). cheers, David
On Dec 4, 2007 3:05 AM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Gael Varoquaux wrote:
On Tue, Dec 04, 2007 at 02:13:53PM +0900, David Cournapeau wrote:
With recent kernels, you can get really good latency if you do it right (around 1-2 ms worst case under high load, including high IO pressure).
As you can see on my page, I indeed measured less than 1ms latency on Linux under load with kernel more than a year old. These things have gotten much better recently and with a premptible kernel you should be able to get 1ms easily. Going below 0.5ms without using a realtime OS (ie a realtime kernel, under linux) is really pushing it.
Yes, 1ms is possible for quite a long time; the problem was how to get there (kernel patches, special permissions, etc... Many of those problems are now gone). I've read that you could get around 0.2 ms and even below (worst case) with the last kernels + RT preempt (that is you still use linux, and not rtlinux). Below 1 ms does not make much sense for audio applications, so I don't know much below this range :)
But I am really curious if you can get those numbers with python, because of malloc, the gc and co. I mean for example, 0.5 ms latency for a 1 Ghz CPU means that you get something like a 500 000 CPU cycles, and I can imagine a cycle of garbage collection taking that many cycles, without even considering pages of virtual memory which are swapped (in this case, we are talking millions of cycles).
If the garbage collector is causing a slowdown, it is possible to turn it off. Then you have to be careful to break cycles manually. Non cyclic garbage will get picked up by reference counting, so you can ignore that. Figuring out references in the context of numpy might be a little tricky given that views imply references, but it's probably not impossible. -tim
cheers,
David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
-- . __ . |-\ . . tim.hochberg@ieee.org
Hi all, I haven't done any serious testing in the past couple years, but for this particular task -- drawing frames using OpenGL without ever skipping a video update -- it is my impression that as of a few Ubuntu releases ago (Edgy?) Windows still beat linux. Just now, I have investigated on 2.6.22-14-generic x86_64 as pacakged by Ubuntu 7.10, and I didn't skip a frame out of 1500 at 60 Hz. That's not much testing, but it is certainly better performance than I've seen in the recent past, so I'll certainly be doing some more testing soon. Oh, how I'd love to never be forced to use Windows again. Leaving my computer displaying moving images overnight, (and tomorrow at lab on a 200 Hz display), Andrew Gael Varoquaux wrote:
On Tue, Dec 04, 2007 at 02:13:53PM +0900, David Cournapeau wrote:
With recent kernels, you can get really good latency if you do it right (around 1-2 ms worst case under high load, including high IO pressure).
As you can see on my page, I indeed measured less than 1ms latency on Linux under load with kernel more than a year old. These things have gotten much better recently and with a premptible kernel you should be able to get 1ms easily. Going below 0.5ms without using a realtime OS (ie a realtime kernel, under linux) is really pushing it.
Cheers,
Gaël _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Andrew Straw wrote:
Hi all,
I haven't done any serious testing in the past couple years, but for this particular task -- drawing frames using OpenGL without ever skipping a video update -- it is my impression that as of a few Ubuntu releases ago (Edgy?) Windows still beat linux.
The problem is that this is the kind of things which is really distribution (because of kernel patch) dependent.
Just now, I have investigated on 2.6.22-14-generic x86_64 as pacakged by Ubuntu 7.10, and I didn't skip a frame out of 1500 at 60 Hz. That's not much testing, but it is certainly better performance than I've seen in the recent past, so I'll certainly be doing some more testing soon. Oh, how I'd love to never be forced to use Windows again.
You should try the rt kernel: https://wiki.ubuntu.com/RealTime/Gutsy. This does make a huge difference. cheers, David
A Monday 03 December 2007, Martin Spacek escrigué:
Sebastian Haase wrote:
reading this thread I have two comments. a) *Displaying* at 200Hz probably makes little sense, since humans would only see about max. of 30Hz (aka video frame rate). Consequently you would want to separate your data frame rate, that (as I understand) you want to save data to disk and - asynchrounously - "display as many frames as you can" (I have used pyOpenGL for this with great satisfaction)
Hi Sebastian,
Although 30Hz looks pretty good, if you watch a 60fps movie, you can easily tell the difference. It's much smoother. Try recording AVIs on a point and shoot digital camera, if you have one that can do both 30fps and 60fps (like my fairly old Canon SD200).
And that's just perception. We're doing neurophysiology, recording from neurons in the visual cortex, which can phase lock to CRT screen rasters up to 100Hz. This is an artifact we don't want to deal with, so we use a 200Hz monitor. I need to be certain of exactly what's on the monitor on every refresh, ie every 5ms, so I run python (with Andrew Straw's package VisionEgg) as a "realtime" priority process in windows on a dual core computer, which lets me reliably update the video frame buffer in time for the next refresh, without having to worry about windows multitasking butting in and stealing CPU cycles for the next 15-20ms. Python runs on one core in "realtime", windows does its junk on the other core. Right now, every 3rd video refresh (ie every 15ms, which is 66.7 Hz, close to the original 60fps the movie was recorded at) I update with a new movie frame. That update needs to happen in less than 5ms, every time. If there's any disk access involved during the update, it inevitably exceeds that time limit, so I have to have it all in RAM before playback begins. Having a second I/O thread running on the second core would be great though.
Perhaps something that can surely improve your timings is first performing a read of your data file(s) while throwing the data as you are reading it. This serves only to load the file entirely (if you have memory enough, but this seems your case) in OS page cache. Then, the second time that your code has to read the data, the OS only have to retrieve it from its cache (i.e. in memory) rather than from disk. You can do this with whatever technique you want, but if you are after reading from a single container and memmap is giving you headaches in 32-bit platforms, you might try PyTables because it allows 64-bit disk addressing transparently, even on 32-bit machines. HTH, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Francesc Altet wrote:
Perhaps something that can surely improve your timings is first performing a read of your data file(s) while throwing the data as you are reading it. This serves only to load the file entirely (if you have memory enough, but this seems your case) in OS page cache. Then, the second time that your code has to read the data, the OS only have to retrieve it from its cache (i.e. in memory) rather than from disk.
I think I tried that, loading the whole file into memory, throwing it away, then trying to load on the fly from "disk" (which would now hopefully be done more optimally the 2nd time around) while displaying the movie, but I still got update times > 5ms. The file's just too big to get any improvement by sort of preloading this way.
You can do this with whatever technique you want, but if you are after reading from a single container and memmap is giving you headaches in 32-bit platforms, you might try PyTables because it allows 64-bit disk addressing transparently, even on 32-bit machines.
PyTables sounds interesting, I might take a look. Thanks. Martin
Sebastian Haase wrote:
b) To my knowledge, any OS Linux, Windows an OSX can max. allocate about 1GB of data - assuming you have a 32 bit machine. The actual numbers I measured varied from about 700MB to maybe 1.3GB. In other words, you would be right at the limit. (For 64bit, you would have to make sure ALL parts are 64bit, e.g. The python version must be >=2.5, python must have been compiled using a 64-bit compiler *and* the windows version (XP-64)) This holds true the same for physical ram allocation and for memmap allocation. My solution to this was to "wait" for the 64bit .... not tested yet ;-)
By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine, and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun! -- Martin
On Dec 20, 2007 3:22 AM, Martin Spacek <numpy@mspacek.mm.st> wrote:
Sebastian Haase wrote:
b) To my knowledge, any OS Linux, Windows an OSX can max. allocate about 1GB of data - assuming you have a 32 bit machine. The actual numbers I measured varied from about 700MB to maybe 1.3GB. In other words, you would be right at the limit. (For 64bit, you would have to make sure ALL parts are 64bit, e.g. The python version must be >=2.5, python must have been compiled using a 64-bit compiler *and* the windows version (XP-64)) This holds true the same for physical ram allocation and for memmap allocation. My solution to this was to "wait" for the 64bit .... not tested yet ;-)
By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine, and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun!
Thanks for the feedback ! Did you get the kind of speed you need and/or the speed you were hoping for ? -Sebastian
By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine, and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun!
Thanks for the feedback ! Did you get the kind of speed you need and/or the speed you were hoping for ?
Nope. Like I wrote earlier, it seems there isn't time for disk access in my main loop, which is what memmap is all about. I resolved this by loading the whole file into memory as a python list of 2D arrays, instead of one huge contiguous 3D array. That got me an extra 100 to 200 MB of physical memory to work with (about 1.4GB out of 2GB total) on win32, which is all I needed. Martin
On Dec 21, 2007 12:11 AM, Martin Spacek <numpy@mspacek.mm.st> wrote:
By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine, and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun!
Thanks for the feedback ! Did you get the kind of speed you need and/or the speed you were hoping for ?
Nope. Like I wrote earlier, it seems there isn't time for disk access in my main loop, which is what memmap is all about. I resolved this by loading the whole file into memory as a python list of 2D arrays, instead of one huge contiguous 3D array. That got me an extra 100 to 200 MB of physical memory to work with (about 1.4GB out of 2GB total) on win32, which is all I needed.
Instead of saying "memmap is ALL about disc access" I would rather like to say that "memap is all about SMART disk access" -- what I mean is that memmap should run as fast as a normal ndarray if it works on the cached part of an array. Maybe there is a way of telling memmap when and what to cache and when to sync that cache to the disk. In other words, memmap should perform just like a in-pysical-memory array -- only that it once-in-a-while saves/load to/from the disk. Or is this just wishful thinking ? Is there a way of "pre loading" a given part into cache (pysical-memory) or prevent disc writes at "bad times" ? How about doing the sync from a different thread ;-) -Sebastian
Sebastian Haase wrote:
On Dec 21, 2007 12:11 AM, Martin Spacek <numpy@mspacek.mm.st> wrote:
By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine, and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun!
Thanks for the feedback ! Did you get the kind of speed you need and/or the speed you were hoping for ?
Nope. Like I wrote earlier, it seems there isn't time for disk access in my main loop, which is what memmap is all about. I resolved this by loading the whole file into memory as a python list of 2D arrays, instead of one huge contiguous 3D array. That got me an extra 100 to 200 MB of physical memory to work with (about 1.4GB out of 2GB total) on win32, which is all I needed.
Instead of saying "memmap is ALL about disc access" I would rather like to say that "memap is all about SMART disk access" -- what I mean is that memmap should run as fast as a normal ndarray if it works on the cached part of an array. Maybe there is a way of telling memmap when and what to cache and when to sync that cache to the disk. In other words, memmap should perform just like a in-pysical-memory array -- only that it once-in-a-while saves/load to/from the disk. Or is this just wishful thinking ? Is there a way of "pre loading" a given part into cache (pysical-memory) or prevent disc writes at "bad times" ? How about doing the sync from a different thread ;-)
mmap is using the OS IO caches, that's kind of the point of using mmap (at least in this case). Instead of doing the caching yourself, the OS does it for you, and OS are supposed to be smart about this :) cheers, David
Am Freitag, 21. Dezember 2007 13:23:49 schrieb David Cournapeau:
Instead of saying "memmap is ALL about disc access" I would rather like to say that "memap is all about SMART disk access" -- what I mean is that memmap should run as fast as a normal ndarray if it works on the cached part of an array. Maybe there is a way of telling memmap when and what to cache and when to sync that cache to the disk. In other words, memmap should perform just like a in-pysical-memory array -- only that it once-in-a-while saves/load to/from the disk. Or is this just wishful thinking ? Is there a way of "pre loading" a given part into cache (pysical-memory) or prevent disc writes at "bad times" ? How about doing the sync from a different thread ;-)
mmap is using the OS IO caches, that's kind of the point of using mmap (at least in this case). Instead of doing the caching yourself, the OS does it for you, and OS are supposed to be smart about this :)
AFAICS this is what Sebastian wanted to say, but as the OP indicated, preloading e.g. by reading the whole array once did not work for him. Thus, I understand Sebastian's questions as "is it possible to help the OS when it is not smart enough?". Maybe something along the lines of mlock, only not quite as aggressive. Ciao, / / /--/ / / ANS
Hans Meine wrote:
Am Freitag, 21. Dezember 2007 13:23:49 schrieb David Cournapeau:
Instead of saying "memmap is ALL about disc access" I would rather like to say that "memap is all about SMART disk access" -- what I mean is that memmap should run as fast as a normal ndarray if it works on the cached part of an array. Maybe there is a way of telling memmap when and what to cache and when to sync that cache to the disk. In other words, memmap should perform just like a in-pysical-memory array -- only that it once-in-a-while saves/load to/from the disk. Or is this just wishful thinking ? Is there a way of "pre loading" a given part into cache (pysical-memory) or prevent disc writes at "bad times" ? How about doing the sync from a different thread ;-)
mmap is using the OS IO caches, that's kind of the point of using mmap (at least in this case). Instead of doing the caching yourself, the OS does it for you, and OS are supposed to be smart about this :)
AFAICS this is what Sebastian wanted to say, but as the OP indicated, preloading e.g. by reading the whole array once did not work for him. Thus, I understand Sebastian's questions as "is it possible to help the OS when it is not smart enough?". Maybe something along the lines of mlock, only not quite as aggressive.
I don't know exactly why it did not work, but it is not difficult to imagine why it could fail (when you read a 2 Gb file, it may not be smart on average to put the whole file in the buffer, since everything else is kicked out). It all depends on the situation, but there are many different things which can influence this behaviour: the IO scheduler, how smart the VM is, the FS (on linux, some FS are better than others for RT audio dsp, and some options are better left out), etc... On Linux, using the deadline IO scheduler can help, for example (that's the recommended scheduler for IO intensive musical applications). But if what you want is to reliable being able to read "in real time" a big file which cannot fit in memory, then you need a design where something is doing the disk buffering as you want (again, taking the example I am somewhat familiar with, in audio processing, you often have a IO thread which does the pre-caching, and put the data into mlock'ed buffers to another thread, the one which is RT). cheers, David
On Dec 21, 2007 6:45 AM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Hans Meine wrote:
Am Freitag, 21. Dezember 2007 13:23:49 schrieb David Cournapeau:
Instead of saying "memmap is ALL about disc access" I would rather like to say that "memap is all about SMART disk access" -- what I mean is that memmap should run as fast as a normal ndarray if it works on the cached part of an array. Maybe there is a way of telling memmap when and what to cache and when to sync that cache to the disk. In other words, memmap should perform just like a in-pysical-memory array -- only that it once-in-a-while saves/load to/from the disk. Or is this just wishful thinking ? Is there a way of "pre loading" a given part into cache (pysical-memory) or prevent disc writes at "bad times" ? How about doing the sync from a different thread ;-)
mmap is using the OS IO caches, that's kind of the point of using mmap (at least in this case). Instead of doing the caching yourself, the OS does it for you, and OS are supposed to be smart about this :)
AFAICS this is what Sebastian wanted to say, but as the OP indicated, preloading e.g. by reading the whole array once did not work for him. Thus, I understand Sebastian's questions as "is it possible to help the OS when it is not smart enough?". Maybe something along the lines of mlock, only not quite as aggressive.
I don't know exactly why it did not work, but it is not difficult to imagine why it could fail (when you read a 2 Gb file, it may not be smart on average to put the whole file in the buffer, since everything else is kicked out). It all depends on the situation, but there are many different things which can influence this behaviour: the IO scheduler, how smart the VM is, the FS (on linux, some FS are better than others for RT audio dsp, and some options are better left out), etc... On Linux, using the deadline IO scheduler can help, for example (that's the recommended scheduler for IO intensive musical applications).
<snip>
But if what you want is to reliable being able to read "in real time" a big file which cannot fit in memory, then you need a design where something is doing the disk buffering as you want (again, taking the example I am somewhat familiar with, in audio processing, you often have a IO thread which does the pre-caching, and put the data into mlock'ed buffers to another thread, the one which is RT).
IIRC, Martin really wanted something like streaming IO broken up into smaller frames with previously cached results ideally discarded. Chuck
On Dec 20, 2007 3:22 AM, Martin Spacek <numpy@mspacek.mm.st> wrote:
Sebastian Haase wrote:
b) To my knowledge, any OS Linux, Windows an OSX can max. allocate about 1GB of data - assuming you have a 32 bit machine. The actual numbers I measured varied from about 700MB to maybe 1.3GB. In other words, you would be right at the limit. (For 64bit, you would have to make sure ALL parts are 64bit, e.g. The python version must be >=2.5, python must have been compiled using a 64-bit compiler *and* the windows version (XP-64)) This holds true the same for physical ram allocation and for memmap allocation. My solution to this was to "wait" for the 64bit .... not tested yet ;-)
By the way, I installed 64-bit linux (ubuntu 7.10) on the same machine, and now numpy.memmap works like a charm. Slicing around a 15 GB file is fun!
Thanks for the feedback ! Did you get the kind of speed you need and/or the speed you were hoping for ? -Sebastian
On Samstag 01 Dezember 2007, Martin Spacek wrote:
Kurt Smith wrote:
You might try numpy.memmap -- others have had success with it for large files (32 bit should be able to handle a 1.3 GB file, AFAIK).
Yeah, I looked into numpy.memmap. Two issues with that. I need to eliminate as much disk access as possible while my app is running. I'm displaying stimuli on a screen at 200Hz, so I have up to 5ms for each movie frame to load before it's too late and it drops a frame. I'm sort of faking a realtime OS on windows by setting the process priority really high. Disk access in the middle of that causes frames to drop. So I need to load the whole file into physical RAM, although it need not be contiguous. memmap doesn't do that, it loads on the fly as you index into the array, which drops frames, so that doesn't work for me.
Sounds as if using memmap and then copying each frame into a separate in-memory ndarray could help? Ciao, / / .o. /--/ ..o / / ANS ooo
Martin Spacek (el 2007-11-30 a les 00:47:41 -0800) va dir::
[...] I find that if I load the file in two pieces into two arrays, say 1GB and 0.3GB respectively, I can avoid the memory error. So it seems that it's not that windows can't allocate the memory, just that it can't allocate enough contiguous memory. I'm OK with this, but for indexing convenience, I'd like to be able to treat the two arrays as if they were one. Specifically, this file is movie data, and the array I'd like to get out of this is of shape (nframes, height, width). [...]
Well, one thing you could do is dump your data into a PyTables_ ``CArray`` dataset, which you may afterwards access as if its was a NumPy array to get slices which are actually NumPy arrays. PyTables datasets have no problem in working with datasets exceeding memory size. For instance:: h5f = tables.openFile('foo.h5', 'w') carray = h5f.createCArray( '/', 'bar', atom=tables.UInt8Atom(), shape=(TOTAL_NROWS, 3) ) base = 0 for array in your_list_of_partial_arrays: carray[base:base+len(array)] = array base += len(array) carray.flush() # Now you can access ``carray`` as a NumPy array. carray[42] --> a (3,) uint8 NumPy array carray[10:20] --> a (10, 3) uint8 NumPy array carray[42,2] --> a NumPy uint8 scalar, "width" for row 42 (You may use an ``EArray`` dataset if you want to enlarge it with new rows afterwards, or a ``Table`` if you want a different type for each field.) .. _PyTables: http://www.pytables.org/ HTH, :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
Well, one thing you could do is dump your data into a PyTables_ ``CArray`` dataset, which you may afterwards access as if its was a NumPy array to get slices which are actually NumPy arrays. PyTables datasets have no problem in working with datasets exceeding memory size. For instance::
I've recently started using PyTables for storing large datasets and I'd give it 10/10! Access is fast enough you can just access the data you need and leave the full array on disk. BC
Ivan Vilata i Balaguer (el 2007-11-30 a les 19:19:38 +0100) va dir::
Well, one thing you could do is dump your data into a PyTables_ ``CArray`` dataset, which you may afterwards access as if its was a NumPy array to get slices which are actually NumPy arrays. PyTables datasets have no problem in working with datasets exceeding memory size. [...]
I've put together the simple script I've attached which dumps a binary file into a PyTables' ``CArray`` or loads it to measure the time taken to load each frame. I've run it on my laptop, which has a not very fast 4200 RPM laptop hard disk, and I've reached average times of 16 ms per frame, after dropping caches with:: # sync && echo 1 > /proc/sys/vm/drop_caches This I've done with the standard chunkshape and no compression. Your data may lean itself very well to bigger chunkshapes and compression, which should lower access times even further. Since (as David pointed out) 200 Hz may be a little exaggerated for human eye, loading individual frames from disk may prove more than enough for your problem. HTH, :: Ivan Vilata i Balaguer >qo< http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data ""
Martin Spacek wrote:
Would it be better to load the file one frame at a time, generating nframes arrays of shape (height, width), and sticking them consecutively in a python list?
I just tried this, and it works. Looks like it's all in physical RAM (no disk thrashing on the 2GB machine), *and* it's easy to index into. I guess I should of thought of this a while ago, since each entry in a python list can point to anywhere in memory. Here's roughly what the code looks like: import numpy as np f = file(fname, 'rb') # 1.3GB file frames = [None] * nframes # init a list to hold all frames for framei in xrange(nframes): # one frame at a time... frame = np.fromfile(f, np.uint8, count=framesize) # load next frame frame.shape = (height, width) frames[framei] = frame # save it in the list -- Martin
participants (13)
-
Andrew Straw
-
Bryan Cole
-
Charles R Harris
-
David Cournapeau
-
Francesc Altet
-
Gael Varoquaux
-
Hans Meine
-
Ivan Vilata i Balaguer
-
Kurt Smith
-
Martin Spacek
-
Sebastian Haase
-
Sebastian Haase
-
Timothy Hochberg