[Numpy-discussion] About the npz format

Valentin Haenel valentin at haenel.co
Thu Apr 17 17:18:09 EDT 2014


* Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> > Hi,
> > 
> > * Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-17]:
> > > On 17.04.2014 21:30, onefire wrote:
> > > > Hi Nathaniel,
> > > > 
> > > > Thanks for the suggestion. I did profile the program before, just not
> > > > using Python.
> > > 
> > > one problem of npz is that the zipfile module does not support streaming
> > > data in (or if it does now we aren't using it).
> > > So numpy writes the file uncompressed to disk and then zips it which is
> > > horrible for performance and disk usage.
> > 
> > As a workaround may also be possible to write the temporary NPY files to
> > cStringIO instances and then use ``ZipFile.writestr`` with the
> > ``getvalue()`` of the cStringIO object. However that approach may
> > require some memory. In python 2.7, for each array: one copy inside the
> > cStringIO instance and then another copy of when calling getvalue on the
> > cString, I believe.
> 
> There is a proof-of-concept implementation here:
> 
> https://github.com/esc/numpy/compare/feature;npz_no_temp_file
> 
> Here are the timings, again using ``sync()`` from bloscpack (but it's
> just a ``os.system('sync')``, in case you want to run your own
> benchmarks):
> 
> In [1]: import numpy as np
> 
> In [2]: import bloscpack.sysutil as bps
> 
> In [3]: x = np.linspace(1, 10, 50000000)
> 
> In [4]: %timeit np.save("x.npy", x) ; bps.sync()
> 1 loops, best of 3: 1.93 s per loop
> 
> In [5]: %timeit np.savez("x.npz", x) ; bps.sync()
> 1 loops, best of 3: 7.88 s per loop
> 
> In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync()
> 1 loops, best of 3: 3.22 s per loop
> 
> Not too bad, but still slower than plain NPY, memory copies would be my
> guess.

> PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master

Also, in cae you were wondering, here is the profiler output:

In [2]: %prun -l 10 np._savez_no_temp("x.npy", [x], {}, False)
         943 function calls (917 primitive calls) in 1.139 seconds

   Ordered by: internal time
   List reduced from 99 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.386    0.386    0.386    0.386 {zlib.crc32}
        8    0.234    0.029    0.234    0.029 {method 'write' of 'file' objects}
       27    0.162    0.006    0.162    0.006 {method 'write' of 'cStringIO.StringO' objects}
        1    0.158    0.158    0.158    0.158 {method 'getvalue' of 'cStringIO.StringO' objects}
        1    0.091    0.091    0.091    0.091 {method 'close' of 'file' objects}
       24    0.064    0.003    0.064    0.003 {method 'tobytes' of 'numpy.ndarray' objects}
        1    0.022    0.022    1.119    1.119 npyio.py:608(_savez_no_temp)
        1    0.019    0.019    1.139    1.139 <string>:1(<module>)
        1    0.002    0.002    0.227    0.227 format.py:362(write_array)
        1    0.001    0.001    0.001    0.001 zipfile.py:433(_GenerateCRCTable)

V-



More information about the NumPy-Discussion mailing list