Mailman 3 About the npz format - NumPy-Discussion

newer
Fwd: [Python-ideas] PEP pre-draft:...

About the npz format

onefire

April 16, 2014

6:26 p.m.

Hi all, I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency). I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue: ln [1]: import numpy as np In [2]: x = np.linspace(1, 10, 50000000) In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s I can inspect the files to verify that they contain the same data, and I can change the example, but this seems to always hold (I am running Arch Linux, but I've done the test on other machines too): for bigger arrays, the npz format seems to add an unbelievable amount of overhead. Looking at Numpy's code, it looks like the real work is being done by Python's zipfile module, and I suspect that all the extra time is spent computing the crc32. Am I correct in my assumption (I am not familiar with zipfile's internals)? Or perhaps I am doing something really dumb and there is an easy way to speed things up? Assuming that I am correct, my next question is: why compute the crc32 at all? I mean, I know that it is part of what defines a "zip file", but is it really necessary for a npz file to be a (compliant) zip file? If, for example, I open the resulting npz file with a hex editor, and insert a bogus crc32, np.load will happily load the file anyway (Gnome's Archive Manager will do the same) To me this suggests that the fact that npz files are zip files is not that important. . Perhaps, people think that the ability to browse arrays and extract individual ones like they would do with a regular zip file is really important, but reading the little documentation that I found, I got the impression that npz files are zip files just because this was the easiest way to have multiple arrays in the same file. But my main point is: it should be fairly simple to make npz files much more efficient with simple changes like not computing checksums (or using a different algorithm like adler32). Let me know what you think about this. I've searched around the internet, and on places like Stackoverflow, it seems that the standard answer is: you are doing it wrong, forget Numpy's format and start using hdf5! Please do not give that answer. Like I said in the beginning, I am well aware of hdf5 and I use it on my "production code" (on C++). But I believe that there should be a lightweight alternative (right now, to use hdf5 I need to have installed the C library, the C++ wrappers, and the h5py library to play with the data using Python, that is a bit too heavy for my needs). I really like Numpy's format (if anything, it makes me feel better knowing that it is so easy to reverse engineer it, while the hdf5 format is very complicated), but the (apparent) poor performance of npz files if a deal breaker. Gilberto

Attachments:

attachment.htm (text/html — 5.0 KB)

Show replies by date

Valentin Haenel

April 2014

8:57 p.m.

Hi Gilberto, * onefire <onefire.myself@gmail.com> [2014-04-16]:

...

I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency).

I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue:

ln [1]: import numpy as np

In [2]: x = np.linspace(1, 10, 50000000)

In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms

In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s

If it just serialization speed, You may want to look at Bloscpack: https://github.com/Blosc/Bloscpack Which only has blosc/python-blosc and Numpy as a dependency. You can use it on Numpy arrays like so: https://github.com/Blosc/Bloscpack#numpy (thats instructions for master you are looking at) And it can certainly be faster than NPZ and sometimes faster than NPY -- depending of course on your system and the type of data -- and also more lightweight than HDF5. I wrote an article about it with some benchmarks, also vs NPY/NPZ here: https://github.com/euroscipy/euroscipy_proceedings/tree/master/papers/23_hae... Since it is not yet officially published, you can find a compiled PDF draft I just made at: http://fldmp.zetatech.org/haenel_bloscpack_euroscipy2013_ac25c19cb6.pdf Perhaps it is interesting for you.

...

I can inspect the files to verify that they contain the same data, and I can change the example, but this seems to always hold (I am running Arch Linux, but I've done the test on other machines too): for bigger arrays, the npz format seems to add an unbelievable amount of overhead.

You mean time or space wise? In my experience NPZ is fairly slow but can yield some good compression rations, depending on the LZ-complexity of the input data. In fact, AFAIK, NPZ uses the DEFLATE algorithm as implemented by ZLIB which is fairly slow and not optimized for compression decompression speed. FYI: if you really want ZLIB, Blosc also supports using it internally, which is nice.

...

Looking at Numpy's code, it looks like the real work is being done by Python's zipfile module, and I suspect that all the extra time is spent computing the crc32. Am I correct in my assumption (I am not familiar with zipfile's internals)? Or perhaps I am doing something really dumb and there is an easy way to speed things up?

I am guessing here, but a checksum *should* be fairly fast. I would guess it is at least in part due to use of DEFLATE.

...

Assuming that I am correct, my next question is: why compute the crc32 at all? I mean, I know that it is part of what defines a "zip file", but is it really necessary for a npz file to be a (compliant) zip file? If, for example, I open the resulting npz file with a hex editor, and insert a bogus crc32, np.load will happily load the file anyway (Gnome's Archive Manager will do the same) To me this suggests that the fact that npz files are zip files is not that important. .

Well, the good news here is that Bloscpack supports adding checksums to secure the integrity of the compressed data. You can choose between many, including CRC32, ADLER32 and even sha512.

...

Perhaps, people think that the ability to browse arrays and extract individual ones like they would do with a regular zip file is really important, but reading the little documentation that I found, I got the impression that npz files are zip files just because this was the easiest way to have multiple arrays in the same file. But my main point is: it should be fairly simple to make npz files much more efficient with simple changes like not computing checksums (or using a different algorithm like adler32)

Ah, so you want to store multiple arrays in a single file. I must disappoint you there, Bloscpack doesn't support that right now. Although it is in principle possible to achieve this.

...

Let me know what you think about this. I've searched around the internet, and on places like Stackoverflow, it seems that the standard answer is: you are doing it wrong, forget Numpy's format and start using hdf5! Please do not give that answer. Like I said in the beginning, I am well aware of hdf5 and I use it on my "production code" (on C++). But I believe that there should be a lightweight alternative (right now, to use hdf5 I need to have installed the C library, the C++ wrappers, and the h5py library to play with the data using Python, that is a bit too heavy for my needs). I really like Numpy's format (if anything, it makes me feel better knowing that it is so easy to reverse engineer it, while the hdf5 format is very complicated), but the (apparent) poor performance of npz files if a deal breaker.

Well, I hope that Bloscpack is lightweight enough for you. As I said the only dependency is blosc/python-blosc which can be compiled using a C compiler (C++ if you want all the additional codecs) and the Python headers. Hope it helps and let me know what you think! V-

Nathaniel Smith

9:03 p.m.

crc32 extremely fast, and I think zip might use adler32 instead which is even faster. OTOH compression is incredibly slow, unless you're using one of the 'just a little bit of compression' formats like blosc or lzo1. If your npz files are compressed then this is certainly the culprit. The zip format supports storing files without compression. Maybe what you want is an option to use this with .npz? -n On 16 Apr 2014 20:26, "onefire" <onefire.myself@gmail.com> wrote:

...

Hi all,

I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency).

I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue:

ln [1]: import numpy as np

In [2]: x = np.linspace(1, 10, 50000000)

In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms

In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s

I can inspect the files to verify that they contain the same data, and I can change the example, but this seems to always hold (I am running Arch Linux, but I've done the test on other machines too): for bigger arrays, the npz format seems to add an unbelievable amount of overhead.

Looking at Numpy's code, it looks like the real work is being done by Python's zipfile module, and I suspect that all the extra time is spent computing the crc32. Am I correct in my assumption (I am not familiar with zipfile's internals)? Or perhaps I am doing something really dumb and there is an easy way to speed things up?

Assuming that I am correct, my next question is: why compute the crc32 at all? I mean, I know that it is part of what defines a "zip file", but is it really necessary for a npz file to be a (compliant) zip file? If, for example, I open the resulting npz file with a hex editor, and insert a bogus crc32, np.load will happily load the file anyway (Gnome's Archive Manager will do the same) To me this suggests that the fact that npz files are zip files is not that important. .

Perhaps, people think that the ability to browse arrays and extract individual ones like they would do with a regular zip file is really important, but reading the little documentation that I found, I got the impression that npz files are zip files just because this was the easiest way to have multiple arrays in the same file. But my main point is: it should be fairly simple to make npz files much more efficient with simple changes like not computing checksums (or using a different algorithm like adler32).

Let me know what you think about this. I've searched around the internet, and on places like Stackoverflow, it seems that the standard answer is: you are doing it wrong, forget Numpy's format and start using hdf5! Please do not give that answer. Like I said in the beginning, I am well aware of hdf5 and I use it on my "production code" (on C++). But I believe that there should be a lightweight alternative (right now, to use hdf5 I need to have installed the C library, the C++ wrappers, and the h5py library to play with the data using Python, that is a bit too heavy for my needs). I really like Numpy's format (if anything, it makes me feel better knowing that it is so easy to reverse engineer it, while the hdf5 format is very complicated), but the (apparent) poor performance of npz files if a deal breaker.

Gilberto

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

onefire

12:57 a.m.

Valentin Haenel, Bloscpack definitely looks interesting but I need to take a careful look first. I will let you know if I like it. Thanks for the suggestion! I think you and Nathaniel Smith misunderstood my questions (my fault, since I did not explain myself well!). First, Numpy's savez will not do any compression by default. It will simply store the npy file normally. The documentation suggests so and I can open the resulting file to confirm it. Also, if you run the commands that I specified in my previous post, you can see that the resulting files have sizes 400000080 (x.npy) and 400000194 (x.npz). The npy header takes 80 bytes (it actually needs less than that, but it is padded to be divisible by 16). The npz file that saves the same array takes 114 extra bytes (for the zip file metadata), so the space overhead is pretty small. What I cannot understand is why savez takes more than 10 times longer than saving the data to a npy file. The only reason that I could come up with was the computation of the crc32. BUT it might be more than this... This afternoon I found out about this Julia package ( https://github.com/fhs/NPZ.jl) to manipulate Numpy files. I did a few tests and it seems to work correctly. It becomes interesting when I do the npy-npz comparison using Julia. Here is the code that I used: using NPZ function write_npy(x) tic() npzwrite("data.npy", x) toc() end function write_npz(x) tic() npzwrite("data.npz", (ASCIIString => Any)["data" => x]) toc() end x = linspace(1, 10, 50000000) write_npy(x) # this prints: elapsed time: 0.417742163 seconds write_npz(x) # this prints: elapsed time: 0.882226675 seconds The Julia timings (tested with Julia 0.3) are closer to what I would expect. Notice that the time to save the npy file is very similar to the one that I got with Numpy's save function (see my previous post), but the "npz overhead" only adds half a second. So now I think there are two things going on: 1) It is wasteful to compute the crc32. At a minimum I would like to either have the option to choose a different, faster checksum (like adler32) or to turn that off (I prefer the second option, because if I am worried about the integrity of the data, I will likely compute the sha512sum of the entire file anyway). 2) The Python implementation is inefficient (to be honest, I just found out about the Julia package and I cannot guarantee anything about its quality, but if I compute a crc32 from 0.5 GB of data from C code, it does takes less than a second!). My guess is that the problem is in the zip module, but like I said before, I do not know the details of what it is doing. Let me know what you think. Gilberto On Wed, Apr 16, 2014 at 5:03 PM, Nathaniel Smith <njs@pobox.com> wrote:

...

crc32 extremely fast, and I think zip might use adler32 instead which is even faster. OTOH compression is incredibly slow, unless you're using one of the 'just a little bit of compression' formats like blosc or lzo1. If your npz files are compressed then this is certainly the culprit.

The zip format supports storing files without compression. Maybe what you want is an option to use this with .npz?

-n On 16 Apr 2014 20:26, "onefire" <onefire.myself@gmail.com> wrote:

...
Hi all,

I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency).

I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue:

ln [1]: import numpy as np

In [2]: x = np.linspace(1, 10, 50000000)

In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms

In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s

I can inspect the files to verify that they contain the same data, and I can change the example, but this seems to always hold (I am running Arch Linux, but I've done the test on other machines too): for bigger arrays, the npz format seems to add an unbelievable amount of overhead.

Looking at Numpy's code, it looks like the real work is being done by Python's zipfile module, and I suspect that all the extra time is spent computing the crc32. Am I correct in my assumption (I am not familiar with zipfile's internals)? Or perhaps I am doing something really dumb and there is an easy way to speed things up?

Assuming that I am correct, my next question is: why compute the crc32 at all? I mean, I know that it is part of what defines a "zip file", but is it really necessary for a npz file to be a (compliant) zip file? If, for example, I open the resulting npz file with a hex editor, and insert a bogus crc32, np.load will happily load the file anyway (Gnome's Archive Manager will do the same) To me this suggests that the fact that npz files are zip files is not that important. .

Perhaps, people think that the ability to browse arrays and extract individual ones like they would do with a regular zip file is really important, but reading the little documentation that I found, I got the impression that npz files are zip files just because this was the easiest way to have multiple arrays in the same file. But my main point is: it should be fairly simple to make npz files much more efficient with simple changes like not computing checksums (or using a different algorithm like adler32).

Let me know what you think about this. I've searched around the internet, and on places like Stackoverflow, it seems that the standard answer is: you are doing it wrong, forget Numpy's format and start using hdf5! Please do not give that answer. Like I said in the beginning, I am well aware of hdf5 and I use it on my "production code" (on C++). But I believe that there should be a lightweight alternative (right now, to use hdf5 I need to have installed the C library, the C++ wrappers, and the h5py library to play with the data using Python, that is a bit too heavy for my needs). I really like Numpy's format (if anything, it makes me feel better knowing that it is so easy to reverse engineer it, while the hdf5 format is very complicated), but the (apparent) poor performance of npz files if a deal breaker.

Gilberto

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

9:23 a.m.

On 17 Apr 2014 01:57, "onefire" <onefire.myself@gmail.com> wrote:

...

What I cannot understand is why savez takes more than 10 times longer

than saving the data to a npy file. The only reason that I could come up with was the computation of the crc32. We can all make guesses but the solution is just to profile it :-). %prun in ipython (and then if you need more granularity installing line_profiler is useful). -n

onefire

7:30 p.m.

Hi Nathaniel, Thanks for the suggestion. I did profile the program before, just not using Python. But following your suggestion, I used %prun. Here's (part of) the output (when I use savez): 195503 function calls in 4.466 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 2 2.284 1.142 2.284 1.142 {method 'close' of '_io.BufferedWriter' objects} 1 0.918 0.918 0.918 0.918 {built-in method remove} 48841 0.568 0.000 0.568 0.000 {method 'write' of '_io.BufferedWriter' objects} 48829 0.379 0.000 0.379 0.000 {built-in method crc32} 48830 0.148 0.000 0.148 0.000 {method 'read' of '_io.BufferedReader' objects} 1 0.090 0.090 0.993 0.993 zipfile.py:1315(write) 1 0.072 0.072 0.072 0.072 {method 'tostring' of 'numpy.ndarray' objects} 48848 0.005 0.000 0.005 0.000 {built-in method len} 1 0.001 0.001 0.270 0.270 format.py:362(write_array) 3 0.000 0.000 0.000 0.000 {built-in method open} 1 0.000 0.000 4.466 4.466 npyio.py:560(_savez) 2 0.000 0.000 0.000 0.000 zipfile.py:1459(close) 1 0.000 0.000 4.466 4.466 {built-in method exec} Here's the output when I use save to save to a npy file: 39 function calls in 0.266 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 4 0.196 0.049 0.196 0.049 {method 'write' of '_io.BufferedWriter' objects} 1 0.069 0.069 0.069 0.069 {method 'tostring' of 'numpy.ndarray' objects} 1 0.001 0.001 0.266 0.266 format.py:362(write_array) 1 0.000 0.000 0.000 0.000 {built-in method open} 1 0.000 0.000 0.266 0.266 npyio.py:406(save) 1 0.000 0.000 0.000 0.000 format.py:261(write_array_header_1_0) 1 0.000 0.000 0.000 0.000 {method 'close' of '_io.BufferedWriter' objects} 1 0.000 0.000 0.266 0.266 {built-in method exec} 1 0.000 0.000 0.000 0.000 format.py:154(magic) 1 0.000 0.000 0.000 0.000 format.py:233(header_data_from_array_1_0) 1 0.000 0.000 0.266 0.266 <string>:1(<module>) 1 0.000 0.000 0.000 0.000 numeric.py:462(asanyarray) 1 0.000 0.000 0.000 0.000 py3k.py:28(asbytes) The calls to close and the built-in method remove seem to be the responsible for the inefficiency of the Numpy implementation (compared to the Julia package that I mentioned before). This was tested using Python 3.4 and Numpy 1.8.1. However if I do the tests with Python 3.3.5 and Numpy 1.8.0, savez becomes much faster, so I think there is something wrong with this combination Python 3.4/Numpy 1.8.1. Also, if I use Python 2.4 and Numpy 1.2 (from my school's cluster) I get that np.save takes about 3.5 seconds and np.savez takes about 7 seconds, so all these timings seem to be hugely dependent on the system/version (maybe this explain David Palao's results?). However, they all point out that a significant amount of time is spent computing the crc32. Notice that prun reports that it takes 0.379 second to compute the crc32 of an array that takes 0.2 seconds to save to a npy file. I believe this is too much! And it get worse if you try to save bigger arrays. On Thu, Apr 17, 2014 at 5:23 AM, Nathaniel Smith <njs@pobox.com> wrote:

...

On 17 Apr 2014 01:57, "onefire" <onefire.myself@gmail.com> wrote:

...
What I cannot understand is why savez takes more than 10 times longer

than saving the data to a npy file. The only reason that I could come up with was the computation of the crc32.

We can all make guesses but the solution is just to profile it :-). %prun in ipython (and then if you need more granularity installing line_profiler is useful).

-n

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Julian Taylor

7:51 p.m.

On 17.04.2014 21:30, onefire wrote:

...

Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage. It would be nice if we could add support for different compression modules like gzip or xz which allow streaming data directly into a file without an intermediate.

Valentin Haenel

8:26 p.m.

Hi, * Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...

On 17.04.2014 21:30, onefire wrote:

...
Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe. best, V-

Valentin Haenel

8:56 p.m.

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...

Hi,

* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote:

...
Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here: https://github.com/esc/numpy/compare/feature;npz_no_temp_file Here are the timings, again using ``sync()`` from bloscpack (but it's just a ``os.system('sync')``, in case you want to run your own benchmarks): In [1]: import numpy as np In [2]: import bloscpack.sysutil as bps In [3]: x = np.linspace(1, 10, 50000000) In [4]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 1.93 s per loop In [5]: %timeit np.savez("x.npz", x) ; bps.sync() 1 loops, best of 3: 7.88 s per loop In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync() 1 loops, best of 3: 3.22 s per loop Not too bad, but still slower than plain NPY, memory copies would be my guess. V- PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master

Valentin Haenel

9:18 p.m.

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
Hi,

* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote:

...
Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Here are the timings, again using ``sync()`` from bloscpack (but it's just a ``os.system('sync')``, in case you want to run your own benchmarks):

In [1]: import numpy as np

In [2]: import bloscpack.sysutil as bps

In [3]: x = np.linspace(1, 10, 50000000)

In [4]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 1.93 s per loop

In [5]: %timeit np.savez("x.npz", x) ; bps.sync() 1 loops, best of 3: 7.88 s per loop

In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync() 1 loops, best of 3: 3.22 s per loop

Not too bad, but still slower than plain NPY, memory copies would be my guess.

...

PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master

Also, in cae you were wondering, here is the profiler output: In [2]: %prun -l 10 np._savez_no_temp("x.npy", [x], {}, False) 943 function calls (917 primitive calls) in 1.139 seconds Ordered by: internal time List reduced from 99 to 10 due to restriction <10> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.386 0.386 0.386 0.386 {zlib.crc32} 8 0.234 0.029 0.234 0.029 {method 'write' of 'file' objects} 27 0.162 0.006 0.162 0.006 {method 'write' of 'cStringIO.StringO' objects} 1 0.158 0.158 0.158 0.158 {method 'getvalue' of 'cStringIO.StringO' objects} 1 0.091 0.091 0.091 0.091 {method 'close' of 'file' objects} 24 0.064 0.003 0.064 0.003 {method 'tobytes' of 'numpy.ndarray' objects} 1 0.022 0.022 1.119 1.119 npyio.py:608(_savez_no_temp) 1 0.019 0.019 1.139 1.139 <string>:1(<module>) 1 0.002 0.002 0.227 0.227 format.py:362(write_array) 1 0.001 0.001 0.001 0.001 zipfile.py:433(_GenerateCRCTable) V-

Valentin Haenel

9:35 p.m.

Hi, * Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
Hi,

* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote:

...
Hi Nathaniel,

Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Here are the timings, again using ``sync()`` from bloscpack (but it's just a ``os.system('sync')``, in case you want to run your own benchmarks):

In [1]: import numpy as np

In [2]: import bloscpack.sysutil as bps

In [3]: x = np.linspace(1, 10, 50000000)

In [4]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 1.93 s per loop

In [5]: %timeit np.savez("x.npz", x) ; bps.sync() 1 loops, best of 3: 7.88 s per loop

In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync() 1 loops, best of 3: 3.22 s per loop

Not too bad, but still slower than plain NPY, memory copies would be my guess.

...
PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master

Also, in cae you were wondering, here is the profiler output:

In [2]: %prun -l 10 np._savez_no_temp("x.npy", [x], {}, False) 943 function calls (917 primitive calls) in 1.139 seconds

Ordered by: internal time List reduced from 99 to 10 due to restriction <10>

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.386 0.386 0.386 0.386 {zlib.crc32} 8 0.234 0.029 0.234 0.029 {method 'write' of 'file' objects} 27 0.162 0.006 0.162 0.006 {method 'write' of 'cStringIO.StringO' objects} 1 0.158 0.158 0.158 0.158 {method 'getvalue' of 'cStringIO.StringO' objects} 1 0.091 0.091 0.091 0.091 {method 'close' of 'file' objects} 24 0.064 0.003 0.064 0.003 {method 'tobytes' of 'numpy.ndarray' objects} 1 0.022 0.022 1.119 1.119 npyio.py:608(_savez_no_temp) 1 0.019 0.019 1.139 1.139 <string>:1(<module>) 1 0.002 0.002 0.227 0.227 format.py:362(write_array) 1 0.001 0.001 0.001 0.001 zipfile.py:433(_GenerateCRCTable)

And, to shed some more light on this, the kernprofiler (line-by-line) output (of a slightly modified version): zsh» cat mp.py import numpy as np x = np.linspace(1, 10, 50000000) np._savez_no_temp("x.npy", [x], {}, False) zsh» ./kernprof.py -v -l mp.py Wrote profile results to mp.py.lprof Timer unit: 1e-06 s File: numpy/lib/npyio.py Function: _savez_no_temp at line 608 Total time: 1.16438 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 608 @profile 609 def _savez_no_temp(file, args, kwds, compress): 610 # Import is postponed to here since zipfile depends on gzip, an optional 611 # component of the so-called standard library. 612 1 5655 5655.0 0.5 import zipfile 613 614 1 6 6.0 0.0 from cStringIO import StringIO 615 616 1 2 2.0 0.0 if isinstance(file, basestring): 617 1 2 2.0 0.0 if not file.endswith('.npz'): 618 1 1 1.0 0.0 file = file + '.npz' 619 620 1 1 1.0 0.0 namedict = kwds 621 2 4 2.0 0.0 for i, val in enumerate(args): 622 1 6 6.0 0.0 key = 'arr_%d' % i 623 1 1 1.0 0.0 if key in namedict.keys(): 624 raise ValueError( 625 "Cannot use un-named variables and keyword %s" % key) 626 1 1 1.0 0.0 namedict[key] = val 627 628 1 0 0.0 0.0 if compress: 629 compression = zipfile.ZIP_DEFLATED 630 else: 631 1 1 1.0 0.0 compression = zipfile.ZIP_STORED 632 633 1 42734 42734.0 3.7 zipf = zipfile_factory(file, mode="w", compression=compression) 634 # reusable memory buffer 635 1 5 5.0 0.0 sio = StringIO() 636 2 10 5.0 0.0 for key, val in namedict.items(): 637 1 3 3.0 0.0 fname = key + '.npy' 638 1 4 4.0 0.0 sio.seek(0) # reset buffer 639 1 219843 219843.0 18.9 format.write_array(sio, np.asanyarray(val)) 640 1 156962 156962.0 13.5 array_bytes = sio.getvalue(True) 641 1 625162 625162.0 53.7 zipf.writestr(fname, array_bytes) 642 643 1 113977 113977.0 9.8 zipf.close() So it would appear that >50% of the time is spent in the zipfile.writestr. V-

Valentin Haenel

4:29 p.m.

Hi, * Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote:

...
Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Anybody interested in me fixing this up (unit tests, API, etc..) for inclusion? V-

Julian Taylor

5:20 p.m.

On 18.04.2014 18:29, Valentin Haenel wrote:

...

Hi,

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote:

...
Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Anybody interested in me fixing this up (unit tests, API, etc..) for inclusion?

I wonder if it would be better to instead use a fifo to avoid the memory doubling. Windows probably hasn't got them (exposed via python) but one can slap a platform check in front. attached a proof of concept without proper error handling (which is unfortunately the tricky part)

Valentin Haenel

July 2014

1:49 p.m.

sorry, for the top-post, but should we add this as an issue on the github tracker? I'd like to revisit it this summer. V- * Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-18]:

...

On 18.04.2014 18:29, Valentin Haenel wrote:

...
Hi,

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote:

...
Thanks for the suggestion. I did profile the program before, just not using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Anybody interested in me fixing this up (unit tests, API, etc..) for inclusion?

I wonder if it would be better to instead use a fifo to avoid the memory doubling. Windows probably hasn't got them (exposed via python) but one can slap a platform check in front. attached a proof of concept without proper error handling (which is unfortunately the tricky part)

...

...
From 472b4c0a44804b65d0774147010ec7a931a1c52d Mon Sep 17 00:00:00 2001 From: Julian Taylor <jtaylor.debian@googlemail.com> Date: Thu, 17 Apr 2014 23:01:47 +0200 Subject: [PATCH] use a pipe for savez

--- numpy/lib/npyio.py | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/numpy/lib/npyio.py b/numpy/lib/npyio.py index 98b4b6e..baafa9d 100644 --- a/numpy/lib/npyio.py +++ b/numpy/lib/npyio.py @@ -585,22 +585,19 @@ def _savez(file, args, kwds, compress): zipf = zipfile_factory(file, mode="w", compression=compression)

# Stage arrays in a temporary file on disk, before writing to zip. - fd, tmpfile = tempfile.mkstemp(suffix='-numpy.npy') - os.close(fd) - try: + import threading + with tempfile.TemporaryDirectory() as td: + fifoname = os.path.join(td, "fifo") + os.mkfifo(fifoname) for key, val in namedict.items(): fname = key + '.npy' - fid = open(tmpfile, 'wb') - try: - format.write_array(fid, np.asanyarray(val)) - fid.close() - fid = None - zipf.write(tmpfile, arcname=fname) - finally: - if fid: - fid.close() - finally: - os.remove(tmpfile) + def mywrite(pipe, val): + with open(pipe, "wb") as wpipe: + format.write_array(wpipe, np.asanyarray(val)) + t = threading.Thread(target=mywrite, args=(fifoname, val)) + t.start() + zipf.write(fifoname, arcname=fname) + t.join()

zipf.close()

-- 1.9.1

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sturla Molden

7:52 a.m.

There is no os.mkfifo on Windows. Sturla Valentin Haenel <valentin@haenel.co> wrote:

...

sorry, for the top-post, but should we add this as an issue on the github tracker? I'd like to revisit it this summer.

V-

* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-18]:

...
On 18.04.2014 18:29, Valentin Haenel wrote:

...
Hi,

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
* Julian Taylor <jtaylor.debian@googlemail.com> [2014-04-17]:

...
On 17.04.2014 21:30, onefire wrote: > Thanks for the suggestion. I did profile the program before, just not > using Python.

one problem of npz is that the zipfile module does not support streaming data in (or if it does now we aren't using it). So numpy writes the file uncompressed to disk and then zips it which is horrible for performance and disk usage.

As a workaround may also be possible to write the temporary NPY files to cStringIO instances and then use ``ZipFile.writestr`` with the ``getvalue()`` of the cStringIO object. However that approach may require some memory. In python 2.7, for each array: one copy inside the cStringIO instance and then another copy of when calling getvalue on the cString, I believe.

There is a proof-of-concept implementation here:

https://github.com/esc/numpy/compare/feature;npz_no_temp_file

Anybody interested in me fixing this up (unit tests, API, etc..) for inclusion?

I wonder if it would be better to instead use a fifo to avoid the memory doubling. Windows probably hasn't got them (exposed via python) but one can slap a platform check in front. attached a proof of concept without proper error handling (which is unfortunately the tricky part)

...
...
From 472b4c0a44804b65d0774147010ec7a931a1c52d Mon Sep 17 00:00:00 2001 From: Julian Taylor <jtaylor.debian@googlemail.com> Date: Thu, 17 Apr 2014 23:01:47 +0200 Subject: [PATCH] use a pipe for savez

--- numpy/lib/npyio.py | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/numpy/lib/npyio.py b/numpy/lib/npyio.py index 98b4b6e..baafa9d 100644 --- a/numpy/lib/npyio.py +++ b/numpy/lib/npyio.py @@ -585,22 +585,19 @@ def _savez(file, args, kwds, compress): zipf = zipfile_factory(file, mode="w", compression=compression)

# Stage arrays in a temporary file on disk, before writing to zip. - fd, tmpfile = tempfile.mkstemp(suffix='-numpy.npy') - os.close(fd) - try: + import threading + with tempfile.TemporaryDirectory() as td: + fifoname = os.path.join(td, "fifo") + os.mkfifo(fifoname) for key, val in namedict.items(): fname = key + '.npy' - fid = open(tmpfile, 'wb') - try: - format.write_array(fid, np.asanyarray(val)) - fid.close() - fid = None - zipf.write(tmpfile, arcname=fname) - finally: - if fid: - fid.close() - finally: - os.remove(tmpfile) + def mywrite(pipe, val): + with open(pipe, "wb") as wpipe: + format.write_array(wpipe, np.asanyarray(val)) + t = threading.Thread(target=mywrite, args=(fifoname, val)) + t.start() + zipf.write(fifoname, arcname=fname) + t.join()

zipf.close()

-- 1.9.1

...
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

David Palao

April 2014

9:17 a.m.

2014-04-16 20:26 GMT+02:00 onefire <onefire.myself@gmail.com>:

...

Hi all,

I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency).

I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue:

ln [1]: import numpy as np

In [2]: x = np.linspace(1, 10, 50000000)

In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms

In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s

Hi, In my case (python-2.7.3, numpy-1.6.1): In [23]: %time save("xx.npy", x) CPU times: user 0.00 s, sys: 0.23 s, total: 0.23 s Wall time: 4.07 s In [24]: %time savez("xx.npz", data = x) CPU times: user 0.42 s, sys: 0.61 s, total: 1.02 s Wall time: 4.26 s In my case I don't see the "unbelievable amount of overhead" of the npz thing. Best

Valentin Haenel

8:01 p.m.

Hi again, * David Palao <dpalao.python@gmail.com> [2014-04-17]:

...

2014-04-16 20:26 GMT+02:00 onefire <onefire.myself@gmail.com>:

...
Hi all,

I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency).

I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue:

ln [1]: import numpy as np

In [2]: x = np.linspace(1, 10, 50000000)

In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms

In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s

Hi, In my case (python-2.7.3, numpy-1.6.1):

In [23]: %time save("xx.npy", x) CPU times: user 0.00 s, sys: 0.23 s, total: 0.23 s Wall time: 4.07 s

In [24]: %time savez("xx.npz", data = x) CPU times: user 0.42 s, sys: 0.61 s, total: 1.02 s Wall time: 4.26 s

In my case I don't see the "unbelievable amount of overhead" of the npz thing.

When profiling IO operations, there are many factors that can influence measurements. In my experience on Linux: these may include: the filesystem cache, the cpu govenor, the system load, power saving features, the type of hard drive and how it is connected, any powersaving features (e.g. laptop-mode tools) and any cron-jobs that might be running (e.g. updating locate DB). So for example when measuring the time it takes to write something to disk on Linux, I always at least include a call to ``sync`` which will ensure that all kernel filesystem buffers will be written to disk. Even then, you may still have a lot of variability. As part of bloscpack.sysutil I have wrapped this to be available from Python (needs root though). So, to re-rurn the benchmarks, doing each one twice: In [1]: import numpy as np In [2]: import bloscpack.sysutil as bps In [3]: x = np.linspace(1, 10, 50000000) In [4]: %time np.save("x.npy", x) CPU times: user 12 ms, sys: 356 ms, total: 368 ms Wall time: 1.41 s In [5]: %time np.save("x.npy", x) CPU times: user 0 ns, sys: 368 ms, total: 368 ms Wall time: 811 ms In [6]: %time np.savez("x.npz", data = x) CPU times: user 540 ms, sys: 864 ms, total: 1.4 s Wall time: 4.74 s In [7]: %time np.savez("x.npz", data = x) CPU times: user 580 ms, sys: 808 ms, total: 1.39 s Wall time: 9.47 s In [8]: bps.sync() In [9]: %time np.save("x.npy", x) ; bps.sync() CPU times: user 0 ns, sys: 368 ms, total: 368 ms Wall time: 2.2 s In [10]: %time np.save("x.npy", x) ; bps.sync() CPU times: user 0 ns, sys: 356 ms, total: 356 ms Wall time: 2.16 s In [11]: bps.sync() In [12]: %time np.savez("x.npz", x) ; bps.sync() CPU times: user 564 ms, sys: 816 ms, total: 1.38 s Wall time: 8.21 s In [13]: %time np.savez("x.npz", x) ; bps.sync() CPU times: user 588 ms, sys: 772 ms, total: 1.36 s Wall time: 6.83 s As you can see, even when using ``sync`` the values might vary, so in addition it might be worth using %timeit, which will at least run it three times and select the best one in its default setting: In [14]: %timeit np.save("x.npy", x) 1 loops, best of 3: 2.4 s per loop In [15]: %timeit np.savez("x.npz", x) 1 loops, best of 3: 7.1 s per loop In [16]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 3.11 s per loop In [17]: %timeit np.savez("x.npz", x) ; bps.sync() 1 loops, best of 3: 7.36 s per loop So, anyway, given these readings, I would tend to support the claim that there is something slowing down writing when using plain NPZ w/o compression. FYI: when reading, the kernel keeps files that were recently read in the filesystem buffers and so when measuring reads, I tend to drop those caches using ``drop_caches()`` from bloscpack.sysutil (which wraps using the linux proc fs). best, V-

Valentin Haenel

10:45 p.m.

Hello, * Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...

As part of bloscpack.sysutil I have wrapped this to be available from Python (needs root though). So, to re-rurn the benchmarks, doing each one twice:

Actually, I just realized, that doing a ``sync`` doesn't require root. my bad, V-

onefire

12:09 a.m.

Interesting! Using sync() as you suggested makes every write slower, and it decreases the time difference between save and savez, so maybe I was observing the 10 times difference because the file system buffers were being flushed immediately after a call to savez, but not right after a call to np.save. I think your workaround might help, but a better solution would be to not use Python's zipfile module at all. This would make it possible to, say, let the user choose the checksum algorithm or to turn that off. Or maybe the compression stuff makes this route too complicated to be worth the trouble? (after all, the zip format is not that hard to understand) Gilberto On Thu, Apr 17, 2014 at 6:45 PM, Valentin Haenel <valentin@haenel.co> wrote:

...

Hello,

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
As part of bloscpack.sysutil I have wrapped this to be available from Python (needs root though). So, to re-rurn the benchmarks, doing each one twice:

Actually, I just realized, that doing a ``sync`` doesn't require root.

my bad,

V- _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

onefire

12:12 a.m.

I found this github issue (https://github.com/numpy/numpy/pull/3465) where someone mentions the idea of forking the zip library. Gilberto On Thu, Apr 17, 2014 at 8:09 PM, onefire <onefire.myself@gmail.com> wrote:

...

Interesting! Using sync() as you suggested makes every write slower, and it decreases the time difference between save and savez, so maybe I was observing the 10 times difference because the file system buffers were being flushed immediately after a call to savez, but not right after a call to np.save.

I think your workaround might help, but a better solution would be to not use Python's zipfile module at all. This would make it possible to, say, let the user choose the checksum algorithm or to turn that off. Or maybe the compression stuff makes this route too complicated to be worth the trouble? (after all, the zip format is not that hard to understand)

Gilberto

On Thu, Apr 17, 2014 at 6:45 PM, Valentin Haenel <valentin@haenel.co>wrote:

...
Hello,

* Valentin Haenel <valentin@haenel.co> [2014-04-17]:

...
As part of bloscpack.sysutil I have wrapped this to be available from Python (needs root though). So, to re-rurn the benchmarks, doing each one twice:

Actually, I just realized, that doing a ``sync`` doesn't require root.

my bad,

V- _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Valentin Haenel

10:16 a.m.

Hi Gilberto, * onefire <onefire.myself@gmail.com> [2014-04-18]:

...

Interesting! Using sync() as you suggested makes every write slower, and it decreases the time difference between save and savez, so maybe I was observing the 10 times difference because the file system buffers were being flushed immediately after a call to savez, but not right after a call to np.save.

I am happy that you found my suggestion useful! Given that the current savez implementation first writes temporary arrays to disk and then copies them from their temporary location to the zipfile, one might argue that this is what causes the buffers to be flushed, since it does more IO than the save implementation. Then again I don't really now the gory details of the how the filesystem buffers behave and how they can be configured. best, V-

Valentin Haenel

11:01 a.m.

Hi again, * onefire <onefire.myself@gmail.com> [2014-04-18]:

...

I think your workaround might help, but a better solution would be to not use Python's zipfile module at all. This would make it possible to, say, let the user choose the checksum algorithm or to turn that off. Or maybe the compression stuff makes this route too complicated to be worth the trouble? (after all, the zip format is not that hard to understand)

Just to give you an idea of what my aforementioned Bloscpack library can do in the case of linspace: In [1]: import numpy as np In [2]: import bloscpack as bp In [3]: import bloscpack.sysutil as bps In [4]: x = np.linspace(1, 10, 50000000) In [5]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 2.12 s per loop In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 1 loops, best of 3: 627 ms per loop In [7]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync() 3 loops, best of 3: 1.92 s per loop In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 3 loops, best of 3: 564 ms per loop In [9]: ls -lah x.npy x.blp -rw-r--r-- 1 root root 49M Apr 18 12:53 x.blp -rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy However, this is a bit of special case, since Blosc does extremely well -- both speed and size wise -- on the linspace data, your milage may vary. best, V-

Francesc Alted

12:03 p.m.

El 18/04/14 13:01, Valentin Haenel ha escrit:

...

Hi again,

* onefire <onefire.myself@gmail.com> [2014-04-18]:

...
I think your workaround might help, but a better solution would be to not use Python's zipfile module at all. This would make it possible to, say, let the user choose the checksum algorithm or to turn that off. Or maybe the compression stuff makes this route too complicated to be worth the trouble? (after all, the zip format is not that hard to understand) Just to give you an idea of what my aforementioned Bloscpack library can do in the case of linspace:

In [1]: import numpy as np

In [2]: import bloscpack as bp

In [3]: import bloscpack.sysutil as bps

In [4]: x = np.linspace(1, 10, 50000000)

In [5]: %timeit np.save("x.npy", x) ; bps.sync() 1 loops, best of 3: 2.12 s per loop

In [6]: %timeit bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 1 loops, best of 3: 627 ms per loop

In [7]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync() 3 loops, best of 3: 1.92 s per loop

In [8]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x.blp') ; bps.sync() 3 loops, best of 3: 564 ms per loop

In [9]: ls -lah x.npy x.blp -rw-r--r-- 1 root root 49M Apr 18 12:53 x.blp -rw-r--r-- 1 root root 382M Apr 18 12:52 x.npy

However, this is a bit of special case, since Blosc does extremely well -- both speed and size wise -- on the linspace data, your milage may vary.

Exactly, and besides, Blosc can use different codes inside it. Just for completeness, here it is a small benchmark of what you can expect from them (my laptop does not have a SSD, so my figures are a bit slow compared with Valentin's): In [50]: %timeit -n 3 -r 3 np.save("x.npy", x) ; bps.sync() 3 loops, best of 3: 5.7 s per loop In [51]: cargs = bp.args.DEFAULT_BLOSC_ARGS In [52]: cargs['cname'] = 'blosclz' In [53]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-blosclz.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.12 s per loop In [54]: cargs['cname'] = 'lz4' In [55]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 985 ms per loop In [56]: cargs['cname'] = 'lz4hc' In [57]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-lz4hc.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.95 s per loop In [58]: cargs['cname'] = 'snappy' In [59]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-snappy.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 1.11 s per loop In [60]: cargs['cname'] = 'zlib' In [61]: %timeit -n 3 -r 3 bp.pack_ndarray_file(x, 'x-zlib.blp', blosc_args=cargs) ; bps.sync() 3 loops, best of 3: 3.12 s per loop so all the codecs can make the storage go faster than a pure np.save(), and most specially blosclz, lz4 and snappy. However, lz4hc and zlib achieve the best compression ratios: In [62]: ls -lht x*.* -rw-r--r-- 1 faltet users 7,0M 18 abr 13:49 x-zlib.blp -rw-r--r-- 1 faltet users 54M 18 abr 13:48 x-snappy.blp -rw-r--r-- 1 faltet users 7,0M 18 abr 13:48 x-lz4hc.blp -rw-r--r-- 1 faltet users 48M 18 abr 13:47 x-lz4.blp -rw-r--r-- 1 faltet users 49M 18 abr 13:47 x-blosclz.blp -rw-r--r-- 1 faltet users 382M 18 abr 13:42 x.npy But again, we are talking about a specially nice compression case. -- Francesc Alted

3852

Age (days ago)

3933

Last active (days ago)

List overview

Download

22 comments

7 participants

participants (7)

David Palao
Francesc Alted
Julian Taylor
Nathaniel Smith
onefire
Sturla Molden
Valentin Haenel

About the npz format

Sturla Molden

tags

participants (7)