[Numpy-discussion] About the npz format
Valentin Haenel
valentin at haenel.co
Thu Apr 17 17:35:37 EDT 2014
Hi,
* Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> > * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
> > > Hi,
> > >
> > > * Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-17]:
> > > > On 17.04.2014 21:30, onefire wrote:
> > > > > Hi Nathaniel,
> > > > >
> > > > > Thanks for the suggestion. I did profile the program before, just not
> > > > > using Python.
> > > >
> > > > one problem of npz is that the zipfile module does not support streaming
> > > > data in (or if it does now we aren't using it).
> > > > So numpy writes the file uncompressed to disk and then zips it which is
> > > > horrible for performance and disk usage.
> > >
> > > As a workaround may also be possible to write the temporary NPY files to
> > > cStringIO instances and then use ``ZipFile.writestr`` with the
> > > ``getvalue()`` of the cStringIO object. However that approach may
> > > require some memory. In python 2.7, for each array: one copy inside the
> > > cStringIO instance and then another copy of when calling getvalue on the
> > > cString, I believe.
> >
> > There is a proof-of-concept implementation here:
> >
> > https://github.com/esc/numpy/compare/feature;npz_no_temp_file
> >
> > Here are the timings, again using ``sync()`` from bloscpack (but it's
> > just a ``os.system('sync')``, in case you want to run your own
> > benchmarks):
> >
> > In [1]: import numpy as np
> >
> > In [2]: import bloscpack.sysutil as bps
> >
> > In [3]: x = np.linspace(1, 10, 50000000)
> >
> > In [4]: %timeit np.save("x.npy", x) ; bps.sync()
> > 1 loops, best of 3: 1.93 s per loop
> >
> > In [5]: %timeit np.savez("x.npz", x) ; bps.sync()
> > 1 loops, best of 3: 7.88 s per loop
> >
> > In [6]: %timeit np._savez_no_temp("x.npy", [x], {}, False) ; bps.sync()
> > 1 loops, best of 3: 3.22 s per loop
> >
> > Not too bad, but still slower than plain NPY, memory copies would be my
> > guess.
>
> > PS: Running Python 2.7.6 :: Anaconda 1.9.2 (64-bit) and Numpy master
>
> Also, in cae you were wondering, here is the profiler output:
>
> In [2]: %prun -l 10 np._savez_no_temp("x.npy", [x], {}, False)
> 943 function calls (917 primitive calls) in 1.139 seconds
>
> Ordered by: internal time
> List reduced from 99 to 10 due to restriction <10>
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 0.386 0.386 0.386 0.386 {zlib.crc32}
> 8 0.234 0.029 0.234 0.029 {method 'write' of 'file' objects}
> 27 0.162 0.006 0.162 0.006 {method 'write' of 'cStringIO.StringO' objects}
> 1 0.158 0.158 0.158 0.158 {method 'getvalue' of 'cStringIO.StringO' objects}
> 1 0.091 0.091 0.091 0.091 {method 'close' of 'file' objects}
> 24 0.064 0.003 0.064 0.003 {method 'tobytes' of 'numpy.ndarray' objects}
> 1 0.022 0.022 1.119 1.119 npyio.py:608(_savez_no_temp)
> 1 0.019 0.019 1.139 1.139 <string>:1(<module>)
> 1 0.002 0.002 0.227 0.227 format.py:362(write_array)
> 1 0.001 0.001 0.001 0.001 zipfile.py:433(_GenerateCRCTable)
And, to shed some more light on this, the kernprofiler (line-by-line)
output (of a slightly modified version):
zsh» cat mp.py
import numpy as np
x = np.linspace(1, 10, 50000000)
np._savez_no_temp("x.npy", [x], {}, False)
zsh» ./kernprof.py -v -l mp.py
Wrote profile results to mp.py.lprof
Timer unit: 1e-06 s
File: numpy/lib/npyio.py
Function: _savez_no_temp at line 608
Total time: 1.16438 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
608 @profile
609 def _savez_no_temp(file, args, kwds, compress):
610 # Import is postponed to here since zipfile depends on gzip, an optional
611 # component of the so-called standard library.
612 1 5655 5655.0 0.5 import zipfile
613
614 1 6 6.0 0.0 from cStringIO import StringIO
615
616 1 2 2.0 0.0 if isinstance(file, basestring):
617 1 2 2.0 0.0 if not file.endswith('.npz'):
618 1 1 1.0 0.0 file = file + '.npz'
619
620 1 1 1.0 0.0 namedict = kwds
621 2 4 2.0 0.0 for i, val in enumerate(args):
622 1 6 6.0 0.0 key = 'arr_%d' % i
623 1 1 1.0 0.0 if key in namedict.keys():
624 raise ValueError(
625 "Cannot use un-named variables and keyword %s" % key)
626 1 1 1.0 0.0 namedict[key] = val
627
628 1 0 0.0 0.0 if compress:
629 compression = zipfile.ZIP_DEFLATED
630 else:
631 1 1 1.0 0.0 compression = zipfile.ZIP_STORED
632
633 1 42734 42734.0 3.7 zipf = zipfile_factory(file, mode="w", compression=compression)
634 # reusable memory buffer
635 1 5 5.0 0.0 sio = StringIO()
636 2 10 5.0 0.0 for key, val in namedict.items():
637 1 3 3.0 0.0 fname = key + '.npy'
638 1 4 4.0 0.0 sio.seek(0) # reset buffer
639 1 219843 219843.0 18.9 format.write_array(sio, np.asanyarray(val))
640 1 156962 156962.0 13.5 array_bytes = sio.getvalue(True)
641 1 625162 625162.0 53.7 zipf.writestr(fname, array_bytes)
642
643 1 113977 113977.0 9.8 zipf.close()
So it would appear that >50% of the time is spent in the
zipfile.writestr.
V-
More information about the NumPy-Discussion
mailing list