fast constructor for arrays from byte data (unpickling?)
Hi, Arrays have a method called tostring() which generates the binary data. Is there an inverse function to that? That is, generating an array from the binary data string? For large matrices, pickling/unpickling is a bit too much overhead (data stored as ASCII character strings instead of binary data strings.) I know, I am talking about a factor 4 here. But there is a big difference between 1 minute loading time and 4 minute loading time. I would imagine this is a very common problem/request for people working with large matrices. And I am sure hacking the C code to provide another fast constructor for arrays from binary strings won't be too hard. The questions is: has anyone already tried it? Is it already there? (For the kludge masters: one kludge is of course to store the binary data on disk, then use cStringIO to build the pickled file and then unpickle from the cStringIO. Speed is probably OK since the pickled file lives on RAM. But that's a kludge. :) ) regards, Hung Jung __________________________________________________ Do You Yahoo!? Make international calls for as low as $.04/minute with Yahoo! Messenger http://phonecard.yahoo.com/
See fromstring()
import Numeric x=Numeric.arange(10) s=x.tostring() y = Numeric.fromstring(s) y array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
-----Original Message----- From: numpy-discussion-admin@lists.sourceforge.net [mailto:numpy-discussion-admin@lists.sourceforge.net]On Behalf Of Hung Jung Lu Sent: Tuesday, August 07, 2001 12:50 PM To: numpy-discussion@lists.sourceforge.net Subject: [Numpy-discussion] fast constructor for arrays from byte data (unpickling?) Hi, Arrays have a method called tostring() which generates the binary data. Is there an inverse function to that? That is, generating an array from the binary data string? For large matrices, pickling/unpickling is a bit too much overhead (data stored as ASCII character strings instead of binary data strings.) I know, I am talking about a factor 4 here. But there is a big difference between 1 minute loading time and 4 minute loading time. I would imagine this is a very common problem/request for people working with large matrices. And I am sure hacking the C code to provide another fast constructor for arrays from binary strings won't be too hard. The questions is: has anyone already tried it? Is it already there? (For the kludge masters: one kludge is of course to store the binary data on disk, then use cStringIO to build the pickled file and then unpickle from the cStringIO. Speed is probably OK since the pickled file lives on RAM. But that's a kludge. :) ) regards, Hung Jung __________________________________________________ Do You Yahoo!? Make international calls for as low as $.04/minute with Yahoo! Messenger http://phonecard.yahoo.com/ _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net http://lists.sourceforge.net/lists/listinfo/numpy-discussion
On Tue, Aug 07, 2001 at 12:50:04PM -0700, Hung Jung Lu wrote:
Hi,
Arrays have a method called tostring() which generates the binary data. Is there an inverse function to that? That is, generating an array from the binary data string?
Numeric.fromstring HTH. -- Robert Kern kern@caltech.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
That is, generating an array from the binary data string?
Numeric.fromstring
Ahhh! Now that seems so obvious. :) (Banging my head.) thanks! Hung Jung __________________________________________________ Do You Yahoo!? Make international calls for as low as $.04/minute with Yahoo! Messenger http://phonecard.yahoo.com/
Hung Jung Lu wrote:
Arrays have a method called tostring() which generates the binary data. Is there an inverse function to that? That is, generating an array from the binary data string?
As you suspected, there is such a thing, and it is called (surprise): fromstring(). It is a function, rather than a method, as it is a constructor for an array. To get data from a file, you have to get it into a string first, so I use: M = fromstring(file.read(),Float) This makes a rank -1 array. You might have to reshape it. Note that you end up creating a Python string of the data first, and then a NumPy array from that. This doesn't really cost that much, but it can be an issue with huge data sets. I wish there was a fromfile() function. I may get around to writing it one day. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------
On Tue, Aug 07, 2001 at 01:26:57PM -0700, Chris Barker wrote: [snip]
Note that you end up creating a Python string of the data first, and then a NumPy array from that. This doesn't really cost that much, but it can be an issue with huge data sets. I wish there was a fromfile() function. I may get around to writing it one day.
With Python 2.0 or greater, one can use mmap'ed files as arguments to fromstring.
-Chris
-- Robert Kern kern@caltech.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Robert Kern wrote:
On Tue, Aug 07, 2001 at 01:26:57PM -0700, Chris Barker wrote:
I wish there was a fromfile() function.
With Python 2.0 or greater, one can use mmap'ed files as arguments to fromstring.
Can you give me an example of how to use it? I can not get it to work at all, the docs are pretty sketchy. I can't even get a mmap's file to be created properly on linux. I havn't tried Windows yet, but I'll need it to work there too! Also, mmap does not seem to be supported on the Mac, which is where I am having memory problems (due to the Mac's lousy memeory management). I'll ask about it on the MacPython list. -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------
On Tue, Aug 07, 2001 at 03:02:01PM -0700, Chris Barker wrote:
Robert Kern wrote:
On Tue, Aug 07, 2001 at 01:26:57PM -0700, Chris Barker wrote:
I wish there was a fromfile() function.
With Python 2.0 or greater, one can use mmap'ed files as arguments to fromstring.
Can you give me an example of how to use it? I can not get it to work at all, the docs are pretty sketchy. I can't even get a mmap's file to be created properly on linux. I havn't tried Windows yet, but I'll need it to work there too!
Yeah, it took me some fiddling, too, before I got it to work. The Windows call to mmap has slightly different semantics in that second argument, but I think that if you want to snarf the whole file, then what I do below should work as well. The Windows API lets you use 0 to default to the whole file, but that's not portable. Python 2.0.1 (#0, Jul 3 2001, 12:36:30) [GCC 2.95.4 20010629 (Debian prerelease)] on linux2 Type "copyright", "credits" or "license" for more information.
import Numeric import mmap f = open('earth.dat', 'r+') f.seek(0, 2) # seek to EOF size = f.tell() f.seek(0) # rewind; possibly not necessary m = mmap.mmap(f.fileno(), size) a = Numeric.fromstring(m) size 548240 a.shape (137060,) size / 4 137060
Also, mmap does not seem to be supported on the Mac, which is where I am having memory problems (due to the Mac's lousy memeory management). I'll ask about it on the MacPython list.
You're probably SOL, there, but I'll wish you good luck anyways. if-wishes-were-mmap-implementations-ly yours, -- Robert Kern kern@caltech.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Robert Kern wrote:
Yeah, it took me some fiddling, too, before I got it to work. The Windows call to mmap has slightly different semantics in that second argument, but I think that if you want to snarf the whole file, then what I do below should work as well. The Windows API lets you use 0 to default to the whole file, but that's not portable.
[code snipped] no, the "0 argument does not seem to work on Unix. It seems odd to not have a way to just snarf the whole file automatically, but so be it Thanks, that works, but now I am wondering: what I want is a fast and memory efficient way to get the contents of a fjile into a NujmPy array, this sure doesn't look any better than: a = fromstring(file.read())) thanks anyway, -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------
On Tue, Aug 07, 2001 at 04:33:30PM -0700, Chris Barker wrote: [snip]
Thanks, that works, but now I am wondering: what I want is a fast and memory efficient way to get the contents of a fjile into a NujmPy array, this sure doesn't look any better than:
a = fromstring(file.read()))
Depends on how large the file is. file.read() creates a temporary string the size of the file. That string isn't freed until fromstring() finishes and returns the array object. For a brief time, both the string and the array have duplicates of the same data taking up space in memory. I don't know the details of mmap, so it's certainly possible that the only way that fromstring knows how to access the data is to pull all of it into memory first, thus recreating the problem. Alas.
thanks anyway,
-Chris
-- Robert Kern kern@caltech.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
Robert Kern wrote:
Thanks, that works, but now I am wondering: what I want is a fast and memory efficient way to get the contents of a file into a NujmPy array, this sure doesn't look any better than:
a = fromstring(file.read()))
Depends on how large the file is. file.read() creates a temporary string the size of the file. That string isn't freed until fromstring() finishes and returns the array object. For a brief time, both the string and the array have duplicates of the same data taking up space in memory.
Exactly. Also is the memory used for hte string guarranteed to be freed right away? I have no idea how Python internals work. Anyway, that's why I want a "fromfile()" method, like the one inthe library array module.
I don't know the details of mmap, so it's certainly possible that the only way that fromstring knows how to access the data is to pull all of it into memory first, thus recreating the problem. Alas.
I don't understand mmap at all. From the name, it sounds like the entire contents of the file is mapped into memory, so the memory would get used as soon as you set it up. If anyone knows, I'd like to hear... -Chris -- Christopher Barker, Ph.D. ChrisHBarker@home.net --- --- --- http://members.home.net/barkerlohmann ---@@ -----@@ -----@@ ------@@@ ------@@@ ------@@@ Oil Spill Modeling ------ @ ------ @ ------ @ Water Resources Engineering ------- --------- -------- Coastal and Fluvial Hydrodynamics -------------------------------------- ------------------------------------------------------------------------
On Wed, Aug 08, 2001 at 01:37:02PM -0700, Chris Barker wrote:
Robert Kern wrote:
Thanks, that works, but now I am wondering: what I want is a fast and memory efficient way to get the contents of a file into a NujmPy array, this sure doesn't look any better than:
a = fromstring(file.read()))
Depends on how large the file is. file.read() creates a temporary string the size of the file. That string isn't freed until fromstring() finishes and returns the array object. For a brief time, both the string and the array have duplicates of the same data taking up space in memory.
Exactly. Also is the memory used for hte string guarranteed to be freed right away? I have no idea how Python internals work.
Once fromstring returns, the string's refcount is 0 and should be freed just about right away. I'm not sure, but I don't think the new gc will affect that.
Anyway, that's why I want a "fromfile()" method, like the one inthe library array module.
Yes, I agree, it would be nice to have.
I don't know the details of mmap, so it's certainly possible that the only way that fromstring knows how to access the data is to pull all of it into memory first, thus recreating the problem. Alas.
I don't understand mmap at all. From the name, it sounds like the entire contents of the file is mapped into memory, so the memory would get used as soon as you set it up. If anyone knows, I'd like to hear...
Performing a very cursory, very non-scientific study, I created a 110M file, mmap'ed it, then made a string from some of it. I'm using 64MB of RAM, 128MB swap partition on a Linux 2.4.7 machine. According to top(1), the memory use didn't jump up until I made the string. Also, given that the call to mmap returned almost instantaneously, I'd say that mmap doesn't pull the whole file into memory when the object is created (one could just read the file for that). I don't know what test to perform to see whether the fromstring() constructor uses double memory, but my guess would be that memcpy won't pull in the whole file before copying. OTOH, the accessed portions of the mmap'ed file may be kept in memory. Does anyone know the details on mmap? I'm shooting in the dark, here. <reads Phil Austin's e-mail> Ooh, nice.
-Chris
-- Robert Kern kern@caltech.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
participants (4)
-
Chris Barker
-
Hung Jung Lu
-
Paul F. Dubois
-
Robert Kern