[Numpy-discussion] About the npz format

Julian Taylor jtaylor.debian at googlemail.com
Fri Apr 18 13:20:33 EDT 2014


On 18.04.2014 18:29, Valentin Haenel wrote:
> Hi,
> 
> * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
>> * Valentin Haenel <valentin at haenel.co> [2014-04-17]:
>>> * Julian Taylor <jtaylor.debian at googlemail.com> [2014-04-17]:
>>>> On 17.04.2014 21:30, onefire wrote:
>>>>> Thanks for the suggestion. I did profile the program before, just not
>>>>> using Python.
>>>>
>>>> one problem of npz is that the zipfile module does not support streaming
>>>> data in (or if it does now we aren't using it).
>>>> So numpy writes the file uncompressed to disk and then zips it which is
>>>> horrible for performance and disk usage.
>>>
>>> As a workaround may also be possible to write the temporary NPY files to
>>> cStringIO instances and then use ``ZipFile.writestr`` with the
>>> ``getvalue()`` of the cStringIO object. However that approach may
>>> require some memory. In python 2.7, for each array: one copy inside the
>>> cStringIO instance and then another copy of when calling getvalue on the
>>> cString, I believe.
>>
>> There is a proof-of-concept implementation here:
>>
>> https://github.com/esc/numpy/compare/feature;npz_no_temp_file
> 
> Anybody interested in me fixing this up (unit tests, API, etc..) for
> inclusion?
> 

I wonder if it would be better to instead use a fifo to avoid the memory
doubling. Windows probably hasn't got them (exposed via python) but one
can slap a platform check in front.
attached a proof of concept without proper error handling (which is
unfortunately the tricky part)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-use-a-pipe-for-savez.patch
Type: text/x-diff
Size: 1652 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140418/0d79d36a/attachment.patch>


More information about the NumPy-Discussion mailing list