Hi Gilberto,
* onefire
I have been playing with the idea of using Numpy's binary format as a lightweight alternative to HDF5 (which I believe is the "right" way to do if one does not have a problem with the dependency).
I am pretty happy with the npy format, but the npz format seems to be broken as far as performance is concerned (or I am missing obvious!). The following ipython session illustrates the issue:
ln [1]: import numpy as np
In [2]: x = np.linspace(1, 10, 50000000)
In [3]: %time np.save("x.npy", x) CPU times: user 40 ms, sys: 230 ms, total: 270 ms Wall time: 488 ms
In [4]: %time np.savez("x.npz", data = x) CPU times: user 657 ms, sys: 707 ms, total: 1.36 s Wall time: 7.7 s
If it just serialization speed, You may want to look at Bloscpack: https://github.com/Blosc/Bloscpack Which only has blosc/python-blosc and Numpy as a dependency. You can use it on Numpy arrays like so: https://github.com/Blosc/Bloscpack#numpy (thats instructions for master you are looking at) And it can certainly be faster than NPZ and sometimes faster than NPY -- depending of course on your system and the type of data -- and also more lightweight than HDF5. I wrote an article about it with some benchmarks, also vs NPY/NPZ here: https://github.com/euroscipy/euroscipy_proceedings/tree/master/papers/23_hae... Since it is not yet officially published, you can find a compiled PDF draft I just made at: http://fldmp.zetatech.org/haenel_bloscpack_euroscipy2013_ac25c19cb6.pdf Perhaps it is interesting for you.
I can inspect the files to verify that they contain the same data, and I can change the example, but this seems to always hold (I am running Arch Linux, but I've done the test on other machines too): for bigger arrays, the npz format seems to add an unbelievable amount of overhead.
You mean time or space wise? In my experience NPZ is fairly slow but can yield some good compression rations, depending on the LZ-complexity of the input data. In fact, AFAIK, NPZ uses the DEFLATE algorithm as implemented by ZLIB which is fairly slow and not optimized for compression decompression speed. FYI: if you really want ZLIB, Blosc also supports using it internally, which is nice.
Looking at Numpy's code, it looks like the real work is being done by Python's zipfile module, and I suspect that all the extra time is spent computing the crc32. Am I correct in my assumption (I am not familiar with zipfile's internals)? Or perhaps I am doing something really dumb and there is an easy way to speed things up?
I am guessing here, but a checksum *should* be fairly fast. I would guess it is at least in part due to use of DEFLATE.
Assuming that I am correct, my next question is: why compute the crc32 at all? I mean, I know that it is part of what defines a "zip file", but is it really necessary for a npz file to be a (compliant) zip file? If, for example, I open the resulting npz file with a hex editor, and insert a bogus crc32, np.load will happily load the file anyway (Gnome's Archive Manager will do the same) To me this suggests that the fact that npz files are zip files is not that important. .
Well, the good news here is that Bloscpack supports adding checksums to secure the integrity of the compressed data. You can choose between many, including CRC32, ADLER32 and even sha512.
Perhaps, people think that the ability to browse arrays and extract individual ones like they would do with a regular zip file is really important, but reading the little documentation that I found, I got the impression that npz files are zip files just because this was the easiest way to have multiple arrays in the same file. But my main point is: it should be fairly simple to make npz files much more efficient with simple changes like not computing checksums (or using a different algorithm like adler32)
Ah, so you want to store multiple arrays in a single file. I must disappoint you there, Bloscpack doesn't support that right now. Although it is in principle possible to achieve this.
Let me know what you think about this. I've searched around the internet, and on places like Stackoverflow, it seems that the standard answer is: you are doing it wrong, forget Numpy's format and start using hdf5! Please do not give that answer. Like I said in the beginning, I am well aware of hdf5 and I use it on my "production code" (on C++). But I believe that there should be a lightweight alternative (right now, to use hdf5 I need to have installed the C library, the C++ wrappers, and the h5py library to play with the data using Python, that is a bit too heavy for my needs). I really like Numpy's format (if anything, it makes me feel better knowing that it is so easy to reverse engineer it, while the hdf5 format is very complicated), but the (apparent) poor performance of npz files if a deal breaker.
Well, I hope that Bloscpack is lightweight enough for you. As I said the only dependency is blosc/python-blosc which can be compiled using a C compiler (C++ if you want all the additional codecs) and the Python headers. Hope it helps and let me know what you think! V-