[Numpy-discussion] About the npz format

Thu Apr 17 17:35:37 EDT 2014

Hi,

* Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> > * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> > > Hi,
> > >
> > > * Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-17]:
> > > > On 17.04.2014 21:30, onefire wrote:
> > > > > Hi Nathaniel,
> > > > >
> > > > > Thanks for the suggestion. I did profile the program before, just not
> > > > > using Python.
> > > >
> > > > one problem of npz is that the zipfile module does not support streaming
> > > > data in (or if it does now we aren't using it).
> > > > So numpy writes the file uncompressed to disk and then zips it which is
> > > > horrible for performance and disk usage.
> > >
> > > As a workaround may also be possible to write the temporary NPY files to
> > > cStringIO instances and then use ``ZipFile.writestr`` with the
> > > ``getvalue()`` of the cStringIO object. However that approach may
> > > require some memory. In python 2.7, for each array: one copy inside the
> > > cStringIO instance and then another copy of when calling getvalue on the
> > > cString, I believe.
> >
> > There is a proof-of-concept implementation here:
> >
> > https://github.com/esc/numpy/compare/feature;npz_no_temp_file
> >
> > Here are the timings, again using ``sync()`` from bloscpack (but it's
> > just a ``os.system('sync')``, in case you want to run your own
> > benchmarks):
> >
> > In [1]: import numpy as np
> >
> > In [2]: import bloscpack.sysutil as bps
> >
> > In [3]: x = np.linspace(1, 10, 50000000)
> >
> > In [4]: %timeit np.save("x.npy", x) ; bps.sync()
> > 1 loops, best of 3: 1.93 s per loop
> >
> > In [5]: %timeit np.savez("x.npz", x) ; bps.sync()
> > 1 loops, best of 3: 7.88 s per loop
> >
> > In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync()
> > 1 loops, best of 3: 3.22 s per loop
> >
> > Not too bad, but still slower than plain NPY, memory copies would be my
> > guess.
>
> > PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master
>
> Also, in cae you were wondering, here is the profiler output:
>
> In [2]: %prun -l 10 np._savez_no_temp("x.npy", [x], {}, False)
>          943 function calls (917 primitive calls) in 1.139 seconds
>
>    Ordered by: internal time
>    List reduced from 99 to 10 due to restriction <10>
>
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         1    0.386    0.386    0.386    0.386 {zlib.crc32}
>         8    0.234    0.029    0.234    0.029 {method 'write' of 'file' objects}
>        27    0.162    0.006    0.162    0.006 {method 'write' of 'cStringIO.StringO' objects}
>         1    0.158    0.158    0.158    0.158 {method 'getvalue' of 'cStringIO.StringO' objects}
>         1    0.091    0.091    0.091    0.091 {method 'close' of 'file' objects}
>        24    0.064    0.003    0.064    0.003 {method 'tobytes' of 'numpy.ndarray' objects}
>         1    0.022    0.022    1.119    1.119 npyio.py:608(_savez_no_temp)
>         1    0.019    0.019    1.139    1.139 <string>:1(<module>)
>         1    0.002    0.002    0.227    0.227 format.py:362(write_array)
>         1    0.001    0.001    0.001    0.001 zipfile.py:433(_GenerateCRCTable)

And, to shed some more light on this, the kernprofiler (line-by-line)
output (of a slightly modified version):

zsh» cat mp.py
import numpy as np
x = np.linspace(1, 10, 50000000)
np._savez_no_temp("x.npy", [x], {}, False)

zsh» ./kernprof.py -v -l mp.py
Wrote profile results to mp.py.lprof
Timer unit: 1e-06 s

File: numpy/lib/npyio.py
Function: _savez_no_temp at line 608
Total time: 1.16438 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   608                                           @profile
   609                                           def _savez_no_temp(file, args, kwds, compress):
   610                                               # Import is postponed to here since zipfile depends on gzip, an optional
   611                                               # component of the so-called standard library.
   612         1         5655   5655.0      0.5      import zipfile
   613
   614         1            6      6.0      0.0      from cStringIO import StringIO
   615
   616         1            2      2.0      0.0      if isinstance(file, basestring):
   617         1            2      2.0      0.0          if not file.endswith('.npz'):
   618         1            1      1.0      0.0              file = file + '.npz'
   619
   620         1            1      1.0      0.0      namedict = kwds
   621         2            4      2.0      0.0      for i, val in enumerate(args):
   622         1            6      6.0      0.0          key = 'arr_%d' % i
   623         1            1      1.0      0.0          if key in namedict.keys():
   624                                                       raise ValueError(
   625                                                           "Cannot use un-named variables and keyword %s" % key)
   626         1            1      1.0      0.0          namedict[key] = val
   627
   628         1            0      0.0      0.0      if compress:
   629                                                   compression = zipfile.ZIP_DEFLATED
   630                                               else:
   631         1            1      1.0      0.0          compression = zipfile.ZIP_STORED
   632
   633         1        42734  42734.0      3.7      zipf = zipfile_factory(file, mode="w", compression=compression)
   634                                               # reusable memory buffer
   635         1            5      5.0      0.0      sio = StringIO()
   636         2           10      5.0      0.0      for key, val in namedict.items():
   637         1            3      3.0      0.0          fname = key + '.npy'
   638         1            4      4.0      0.0          sio.seek(0)  # reset buffer
   639         1       219843 219843.0     18.9          format.write_array(sio, np.asanyarray(val))
   640         1       156962 156962.0     13.5          array_bytes = sio.getvalue(True)
   641         1       625162 625162.0     53.7          zipf.writestr(fname, array_bytes)
   642
   643         1       113977 113977.0      9.8      zipf.close()

So it would appear that >50% of the time is spent in the
zipfile.writestr.

V-