![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
Hello all, I am finding that directly packing numpy arrays into binary using the tostring and fromstring methods do not provide a speed improvement over writing the same arrays to ascii files. Obviously, the size of the resulting files is far smaller, but I was hoping to get an improvement in the speed of writing. I got that speed improvement using the struct module directly, or by using generic python arrays. Let me further describe my methodological issue as it may directly relate to any solution you might have. My output file is heterogeneous. Each line is either an array of integers or floats. Each record is made up of three entries. They serve as a sparse representation of a large matrix. 1) row, n (both integers) 2) array of integers of length n, representing columns 3) array of floats of length n, representing values Here, "n" is not constant across the records, so many of the database structures I have looked at do not apply. Any suggestions would be greatly appreciated. Mark Janikas
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
On Tue, Feb 13, 2007 at 11:42:35AM -0800, Mark Janikas wrote:
I am finding that directly packing numpy arrays into binary using the tostring and fromstring methods do not provide a speed improvement over writing the same arrays to ascii files. Obviously, the size of the resulting files is far smaller, but I was hoping to get an improvement in the speed of writing. I got that speed improvement using the struct module directly, or by using generic python arrays. Let me further describe my methodological issue as it may directly relate to any solution you might have.
Hi Mark Can you post a benchmark code snippet to demonstrate your results? Here, using 1.0.2.dev3545, I see: In [26]: x = N.random.random(100) In [27]: timeit f = file('/tmp/blah.dat','w'); f.write(str(x)) 100 loops, best of 3: 1.77 ms per loop In [28]: timeit f = file('/tmp/blah','w'); x.tofile(f) 10000 loops, best of 3: 100 µs per loop (I see the same results for heterogeneous arrays) Cheers Stéfan
![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
Good call Stefan, I decoupled the timing from the application (duh!) and got better results: from numpy import * import numpy.random as RAND import time as TIME x = RAND.random(1000) xl = x.tolist() t1 = TIME.clock() xStringOut = [ str(i) for i in xl ] xStringOut = " ".join(xStringOut) f = file('blah.dat','w'); f.write(xStringOut) t2 = TIME.clock() total = t2 - t1 t1 = TIME.clock() f = file('blah.bwt','wb') xBinaryOut = x.tostring() f.write(xBinaryOut) t2 = TIME.clock() total1 = t2 - t1
total 0.00661 total1 0.00229
Printing x directly to a string took REALLY long: f.write(str(x)) = 0.0258 The problem therefore, must be in the way I am appending values to the empty arrays. I am currently using the append method: myArray = append(myArray, newValue) Or would it be faster to concat or use a list append then convert? But to be more sure, Ill have to profile it. It seems a bit odd in that there are far less loops and conversions in my current implementation for the binary, yet it is still running slower. -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Stefan van der Walt Sent: Tuesday, February 13, 2007 12:03 PM To: numpy-discussion@scipy.org Subject: Re: [Numpy-discussion] fromstring, tostring slow? On Tue, Feb 13, 2007 at 11:42:35AM -0800, Mark Janikas wrote:
I am finding that directly packing numpy arrays into binary using the tostring and fromstring methods do not provide a speed improvement over writing the same arrays to ascii files. Obviously, the size of the resulting files is far smaller, but I was hoping to get an improvement in the speed of writing. I got that speed improvement using the struct module directly, or by using generic python arrays. Let me further describe my methodological issue as it may directly relate to any solution you might have.
Hi Mark Can you post a benchmark code snippet to demonstrate your results? Here, using 1.0.2.dev3545, I see: In [26]: x = N.random.random(100) In [27]: timeit f = file('/tmp/blah.dat','w'); f.write(str(x)) 100 loops, best of 3: 1.77 ms per loop In [28]: timeit f = file('/tmp/blah','w'); x.tofile(f) 10000 loops, best of 3: 100 µs per loop (I see the same results for heterogeneous arrays) Cheers Stéfan _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 2/13/07, Mark Janikas <mjanikas@esri.com> wrote:
Good call Stefan,
I decoupled the timing from the application (duh!) and got better results:
from numpy import * import numpy.random as RAND import time as TIME
x = RAND.random(1000) xl = x.tolist()
t1 = TIME.clock() xStringOut = [ str(i) for i in xl ] xStringOut = " ".join(xStringOut) f = file('blah.dat','w'); f.write(xStringOut) t2 = TIME.clock() total = t2 - t1 t1 = TIME.clock() f = file('blah.bwt','wb') xBinaryOut = x.tostring() f.write(xBinaryOut) t2 = TIME.clock() total1 = t2 - t1
total 0.00661 total1 0.00229
Printing x directly to a string took REALLY long: f.write(str(x)) = 0.0258
The problem therefore, must be in the way I am appending values to the empty arrays. I am currently using the append method:
myArray = append(myArray, newValue)
Or would it be faster to concat or use a list append then convert?
I am going to guess that a list would be faster for appending. Concat and, I suspect, append make new arrays for each use, rather like string concatenation in Python. A list, on the other hand, is no doubt optimized for adding new values. Another option might be using PyTables with extensible arrays. In any case, a bit of timing should show the way if the performance is that crucial to your application. Chuck
![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
Yup. It was faster to: Use lists for the append, then transform into an array, then transform into a binary string Rather than Create empty arrays and use its append method, then transform into a binary string. The last question on the output when then be to test the speed of using generic Python arrays, which have append methods as well. Then, there would still only be the binary string conversion as apposed to list-->numpy array-->binary string Thanks to all for your input.... MJ ________________________________ From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Charles R Harris Sent: Tuesday, February 13, 2007 12:44 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] fromstring, tostring slow? I am going to guess that a list would be faster for appending. Concat and, I suspect, append make new arrays for each use, rather like string concatenation in Python. A list, on the other hand, is no doubt optimized for adding new values. Another option might be using PyTables with extensible arrays. In any case, a bit of timing should show the way if the performance is that crucial to your application. Chuck
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Mark Janikas wrote:
I am finding that directly packing numpy arrays into binary using the tostring and fromstring methods
For starters, use fromfile and tofile, to save the overhead of creating an entire extra string. fromfile is a function (as it is an alternate constructor for arrays): numpy.fromfile() ndarray.tofile() is an array method. Enclosed is your test, including a test for tofile(), I needed to make the arrays much larger, and use time.time() rather than time.clock() to get enough time resolution to see anything, though if you really want to be accurate, you need to use the timeit module. My results: Using lists 0.457561016083 Using tostring 0.00922703742981 Using tofile 0.00431108474731 Another note: where is the data coming from -- there may be ways to optimize this whole process if we saw that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov #!/usr/bin/env python from numpy import * import numpy.random as RAND import time as TIME x = RAND.random(100000) xl = x.tolist() t1 = TIME.time() xStringOut = [ str(i) for i in xl ] xStringOut = " ".join(xStringOut) f = file('blah.dat','w'); f.write(xStringOut) t2 = TIME.time() total = t2 - t1 print "Using lists", total t1 = TIME.time() f = file('blah.bwt','wb') xBinaryOut = x.tostring() f.write(xBinaryOut) t2 = TIME.time() total1 = t2 - t1 print "Using tostring", total1 t1 = TIME.time() f = file('blah.bwt','wb') x.tofile(f) t2 = TIME.time() total1 = t2 - t1 print "Using tofile", total1
![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
I don't think I can do that because I have heterogeneous rows of data.... I.e. the columns in each row are different in length. Furthermore, when reading it back in, I want to read only bytes of the info at a time so I can save memory. In this case, I only want to have one record in mem at once. Another issue has arisen from taking this routine cross-platform.... namely, if I write the file on Windows I cant read it on Solaris. I assume the big-little endian is at hand here. I know using the struct module that I can pack using either one. Perhaps I will have to go back to the drawing board. I actually love these methods now because I get back out directly what I put in. Great kudos to the developers.... MJ -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Christopher Barker Sent: Tuesday, February 13, 2007 1:39 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] fromstring, tostring slow? Mark Janikas wrote:
I am finding that directly packing numpy arrays into binary using the tostring and fromstring methods
For starters, use fromfile and tofile, to save the overhead of creating an entire extra string. fromfile is a function (as it is an alternate constructor for arrays): numpy.fromfile() ndarray.tofile() is an array method. Enclosed is your test, including a test for tofile(), I needed to make the arrays much larger, and use time.time() rather than time.clock() to get enough time resolution to see anything, though if you really want to be accurate, you need to use the timeit module. My results: Using lists 0.457561016083 Using tostring 0.00922703742981 Using tofile 0.00431108474731 Another note: where is the data coming from -- there may be ways to optimize this whole process if we saw that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
On Tue, Feb 13, 2007 at 03:44:37PM -0800, Mark Janikas wrote:
I don't think I can do that because I have heterogeneous rows of data.... I.e. the columns in each row are different in length. Furthermore, when reading it back in, I want to read only bytes of the info at a time so I can save memory. In this case, I only want to have one record in mem at once.
Another issue has arisen from taking this routine cross-platform.... namely, if I write the file on Windows I cant read it on Solaris. I assume the big-little endian is at hand here.
Indeed. You may want to take a look at npfile, the new IO module in scipy written by Matthew Brett (you don't have to install the whole scipy to use it, just grab the file). Cheers Stéfan
![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
Yes, but does the code have the same license as NumPy? As I work for a software company, where I help with the scripting interface, I must make sure everything I use is cited and has the appropriate license. MJ -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Stefan van der Walt Sent: Tuesday, February 13, 2007 3:52 PM To: numpy-discussion@scipy.org Subject: Re: [Numpy-discussion] fromstring, tostring slow? On Tue, Feb 13, 2007 at 03:44:37PM -0800, Mark Janikas wrote:
I don't think I can do that because I have heterogeneous rows of data.... I.e. the columns in each row are different in length. Furthermore, when reading it back in, I want to read only bytes of the info at a time so I can save memory. In this case, I only want to have one record in mem at once.
Another issue has arisen from taking this routine cross-platform.... namely, if I write the file on Windows I cant read it on Solaris. I assume the big-little endian is at hand here.
Indeed. You may want to take a look at npfile, the new IO module in scipy written by Matthew Brett (you don't have to install the whole scipy to use it, just grab the file). Cheers Stéfan _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Mark Janikas wrote:
Yes, but does the code have the same license as NumPy? As I work for a software company, where I help with the scripting interface, I must make sure everything I use is cited and has the appropriate license.
The numpy and scipy licenses are the same except for the details like the licensing party. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
On Tue, Feb 13, 2007 at 04:02:10PM -0800, Mark Janikas wrote:
Yes, but does the code have the same license as NumPy? As I work for a software company, where I help with the scripting interface, I must make sure everything I use is cited and has the appropriate license.
Yes, Scipy and Numpy are both released under BSD licenses. Cheers Stéfan
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Mark Janikas wrote:
I don't think I can do that because I have heterogeneous rows of data.... I.e. the columns in each row are different in length.
like I said, show us your whole problem... But you don't have to write.read all the data at once with from/tofile() anyway. Each of your "rows" has to be in a separate array anyway, as numpy arrays don't support "ragged" arrays, but each row can be written with tofile()
Furthermore, when reading it back in, I want to read only bytes of the info at a time so I can save memory. In this case, I only want to have one record in mem at once.
you can make multiple calls to fromfile(), thou you'll have to know how long each record is.
Another issue has arisen from taking this routine cross-platform.... namely, if I write the file on Windows I cant read it on Solaris. I assume the big-little endian is at hand here.
yup.
I know using the struct module that I can pack using either one.
so can numpy. see the "byteswap" method, and you can specify a particular endianess with a datatype when you read with fromfile(): a = N.fromfile(DataFile, dtype=N.dtype("<d"), count=20) reads 20 little-endian doubles from DataFile, regardless of the native endianess of the machine you're on. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
This is all very good info. Especially, the byteswap. Ill be testing it momentarily. As far as a detailed explanation of the problem.... In essence, I am applying sparse matrix multiplication. The matrix of which I am dealing with in the matter described is nxn. Generally, this matrix is 1-20% sparse. I use it in spatial data analysis, where the matrix W represents the spatial association between n observations. The operations I perform on it are generally related to the spatial lag of a variable... or Wy, where y is a nxk matrix (usually k=1). As k is generally small, the y vector and the result vector are represented by numpy arrays. I can have nxkx2 pieces of info in mem (usually). What I cant have is n**2. So, I store each row of W in a file as a record consisting of 3 parts: 1) row, nn (# of neighbors) 2) nhs (nx1) vector of integers representing the columns in row[i] != 0 3) weights (nx1) vector of floats corresponding to the index in the previous row The first two parts of the record are known as a GAL or geographic algorithm library. Since a lot of my W matrices have distance metrics associated with them I added the third. I think this might be termed by someone else as an enhanced GAL. At any rate, this allows me to perform this operation on large datasets w/o running out of mem. -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Christopher Barker Sent: Tuesday, February 13, 2007 4:07 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] fromstring, tostring slow? Mark Janikas wrote:
I don't think I can do that because I have heterogeneous rows of data.... I.e. the columns in each row are different in length.
like I said, show us your whole problem... But you don't have to write.read all the data at once with from/tofile() anyway. Each of your "rows" has to be in a separate array anyway, as numpy arrays don't support "ragged" arrays, but each row can be written with tofile()
Furthermore, when reading it back in, I want to read only bytes of the info at a time so I can save memory. In this case, I only want to have one record in mem at once.
you can make multiple calls to fromfile(), thou you'll have to know how long each record is.
Another issue has arisen from taking this routine cross-platform.... namely, if I write the file on Windows I cant read it on Solaris. I assume the big-little endian is at hand here.
yup.
I know using the struct module that I can pack using either one.
so can numpy. see the "byteswap" method, and you can specify a particular endianess with a datatype when you read with fromfile(): a = N.fromfile(DataFile, dtype=N.dtype("<d"), count=20) reads 20 little-endian doubles from DataFile, regardless of the native endianess of the machine you're on. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/09ea2202a148196130137fefdbbf32c2.jpg?s=120&d=mm&r=g)
Found a typo-or-two in my description. #2 and #3 are nnx1 in shape -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Mark Janikas Sent: Tuesday, February 13, 2007 4:31 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] fromstring, tostring slow? This is all very good info. Especially, the byteswap. Ill be testing it momentarily. As far as a detailed explanation of the problem.... In essence, I am applying sparse matrix multiplication. The matrix of which I am dealing with in the matter described is nxn. Generally, this matrix is 1-20% sparse. I use it in spatial data analysis, where the matrix W represents the spatial association between n observations. The operations I perform on it are generally related to the spatial lag of a variable... or Wy, where y is a nxk matrix (usually k=1). As k is generally small, the y vector and the result vector are represented by numpy arrays. I can have nxkx2 pieces of info in mem (usually). What I cant have is n**2. So, I store each row of W in a file as a record consisting of 3 parts: 1) row, nn (# of neighbors) 2) nhs (nx1) vector of integers representing the columns in row[i] != 0 3) weights (nx1) vector of floats corresponding to the index in the previous row The first two parts of the record are known as a GAL or geographic algorithm library. Since a lot of my W matrices have distance metrics associated with them I added the third. I think this might be termed by someone else as an enhanced GAL. At any rate, this allows me to perform this operation on large datasets w/o running out of mem. -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Christopher Barker Sent: Tuesday, February 13, 2007 4:07 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] fromstring, tostring slow? Mark Janikas wrote:
I don't think I can do that because I have heterogeneous rows of data.... I.e. the columns in each row are different in length.
like I said, show us your whole problem... But you don't have to write.read all the data at once with from/tofile() anyway. Each of your "rows" has to be in a separate array anyway, as numpy arrays don't support "ragged" arrays, but each row can be written with tofile()
Furthermore, when reading it back in, I want to read only bytes of the info at a time so I can save memory. In this case, I only want to have one record in mem at once.
you can make multiple calls to fromfile(), thou you'll have to know how long each record is.
Another issue has arisen from taking this routine cross-platform.... namely, if I write the file on Windows I cant read it on Solaris. I assume the big-little endian is at hand here.
yup.
I know using the struct module that I can pack using either one.
so can numpy. see the "byteswap" method, and you can specify a particular endianess with a datatype when you read with fromfile(): a = N.fromfile(DataFile, dtype=N.dtype("<d"), count=20) reads 20 little-endian doubles from DataFile, regardless of the native endianess of the machine you're on. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Charles R Harris
-
Christopher Barker
-
Mark Janikas
-
Robert Kern
-
Stefan van der Walt