[Python-ideas] Re: Specify number of items to allocate for array.array() constructor

22 Feb 2020

      On Fri, Feb 21, 2020 at 12:43 AM Steven D'Aprano 
wrote:
...
On Thu, Feb 20, 2020 at 02:19:13PM -0800, Stephan Hoyer wrote:
...
...
...
Strong +1 for an array.zeros() constructor, and/or a lower level
array.empty() which doesn't pre-fill values.
So it'd be a shorthand for something like this?
...
...
...
array.array("i", bytes(64))
array('i', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
It'd be convenient to specify the size as a number of array elements
rather than bytes. But I'm not a heavy user of array.array() so I
won't say either way as to whether this is needed.
Yes, exactly.
The main problem with array.array("i", bytes(64)) is that memory gets
allocated twice, first to create the bytes() object and then to make the
array(). This makes it unsuitable for high performance applications.
Got some actual measurements to demonstrate that initialising the array
is a bottleneck? Especially for something as small as 64, it seems
unlikely. If it were 64MB, that might be another story.
That's right, the real use-case is quickly deserializing large amounts of
data (e.g., 100s of MB) from a wire format into a form suitable for fast
analysis with NumPy or pandas. Unfortunately I can't share an actual code
example, but this is a pretty common scenario in the data processing world,
e.g., reminiscent of the use-cases for PEP 574 (
https://www.python.org/dev/peps/pep-0574/).

The concern is not just speed (which I agree is probably not impacted too
poorly by an extra copy) but also memory overhead. If the resulting array
is 500 MB and deserialization can be done in a streaming fashion, I don't
want to wastefully allocate another 500 MB just to do a memory copy.