Specify number of items to allocate for array.array() constructor
At the moment, the array module of the standard library allows to create arrays of different numeric types and to initialize them from an iterable (eg, another array). What's missing is the possiblity to specify the final size of the array (number of items), especially for large arrays. I'm thinking of suffix arrays (a text indexing data structure) for large texts, eg the human genome and its reverse complement (about 6 billion characters from the alphabet ACGT). The suffix array is a long int array of the same size (8 bytes per number, so it occupies about 48 GB memory). At the moment I am extending an array in chunks of several million items at a time at a time, which is slow and not elegant. The function below also initializes each item in the array to a given value (0 by default). Is there a reason why there the array.array constructor does not allow to simply specify the number of items that should be allocated? (I do not really care about the contents.) Would this be a worthwhile addition to / modification of the array module? My suggestions is to modify array generation in such a way that you could pass an iterator (as now) as second argument, but if you pass a single integer value, it should be treated as the number of items to allocate. Here is my current workaround (which is slow): def filled_array(typecode, n, value=0, bsize=(1<<22)): """returns a new array with given typecode (eg, "l" for long int, as in the array module) with n entries, initialized to the given value (default 0) """ a = array.array(typecode, [value]*bsize) x = array.array(typecode) r = n while r >= bsize: x.extend(a) r -= bsize x.extend([value]*r) return x
On Wed, Jul 20, 2011 at 10:48:23PM +0200, Sven Rahmann wrote:
I'm thinking of suffix arrays (a text indexing data structure) for large texts, eg the human genome and its reverse complement (about 6 billion characters from the alphabet ACGT). The suffix array is a long int array of the same size (8 bytes per number, so it occupies about 48 GB memory).
I doubt array.array was designed to handle data of such size. Why not to use bsddb or such? Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
Hello,
Is there a reason why there the array.array constructor does not allow to simply specify the number of items that should be allocated? (I do not really care about the contents.) Would this be a worthwhile addition to / modification of the array module?
I think it would be. Rather than an additional constructor parameter, perhaps it could be a separate method (like we already have extend(), fromfile(), etc.). In the meantime, on 3.x you should be able to use the following trick, at least under modern Linux / Unix systems:
f = open("/dev/zero", "rb") m = mmap.mmap(f.fileno(), 2*1024*1024*1024, access=mmap.ACCESS_READ) # I'm putting 2GB for the sake of example. Since the mmap is # created over /dev/zero, kernel optimizations should avoid any # physical RAM consumption (virtual memory aka address space # will still be used, of course). a = array.array("i") a.frombytes(m) len(a) 536870912 a[0] 0 a[-1] 0 m.close(); f.close() # This releases any virtual memory used by the mmap object
Regards Antoine.
On Jul 20, 2011, at 3:02 PM, Antoine Pitrou wrote:
Hello,
Is there a reason why there the array.array constructor does not allow to simply specify the number of items that should be allocated? (I do not really care about the contents.) Would this be a worthwhile addition to / modification of the array module?
I think it would be. Rather than an additional constructor parameter, perhaps it could be a separate method (like we already have extend(), fromfile(), etc.).
+1, Jython's had a custom function for this forever, originally from its old jarray module: jarray.zeros(length, typecode). I think its swapped order of args are due to jarray probably predating the array module. -- Philip Jenvey
Philip Jenvey <pjenvey@...> writes:
+1, Jython's had a custom function for this forever, originally from its old jarray module: jarray.zeros(length, typecode). I think its swapped order of args are due to jarray probably predating the array module.
-- Philip Jenvey
array.zeroes(typecode, length) seems like a good API to me. It's also consistant with the name for the numpy method which does the same. I'll go ahead and file a bug for this and try to work up a patch. Alex
Am 22.07.2011 20:27, schrieb Alex Gaynor:
Philip Jenvey <pjenvey@...> writes:
+1, Jython's had a custom function for this forever, originally from its old jarray module: jarray.zeros(length, typecode). I think its swapped order of args are due to jarray probably predating the array module.
-- Philip Jenvey
array.zeroes(typecode, length) seems like a good API to me. It's also consistant with the name for the numpy method which does the same.
Would have to be array.zeros() then :) Georg
I suggest array.array(typecode, [0]) * N would be cool if it was a param to the constructor though (generated a 0'd array I suppose), as python's which have GC's that allocate pre-zero'd memory could optimize it nicely ;) Alex
I discovered that same trick. It would be nice to have that specifically indicated in the documentation until/unless a length argument is added to the constructor. It would be nice for the supported operators to be documented at all, actually. I didn't realize that array.array had multiplication operator support at all until I got around to dir-ing an instance.
On Thu, Feb 20, 2020 at 6:00 AM Steve Jorgensen <stevecjor@gmail.com> wrote:
It would be nice for the supported operators to be documented at all, actually.
It is a mutable sequence -/ is that not clear from the docs? I don’t think it should have a length argument, but maybe a “filled_with” constructor, like numpy’s ones(), zeros(), and empty(). -CHB I didn't realize that array.array had multiplication operator support at
all until I got around to dir-ing an instance. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/P74LYD... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
20.07.11 23:48, Sven Rahmann пише:
At the moment, the array module of the standard library allows to create arrays of different numeric types and to initialize them from an iterable (eg, another array). What's missing is the possiblity to specify the final size of the array (number of items), especially for large arrays.
array.array(typecode, [fillvalue]) * n
On Thu, Feb 20, 2020 at 7:34 AM Serhiy Storchaka <storchaka@gmail.com> wrote:
20.07.11 23:48, Sven Rahmann пише:
What's missing is the possiblity to specify the final size of the array (number of items), especially for large arrays.
array.array(typecode, [fillvalue]) * n
Perhaps the OP wanted the internal array size initialized, but not used. Currently the internal array will automatically be reallocated to grow as needed. Which could be a performance hit if you know it’s going to grow large. But frankly, it would be a rare case where this would be noticeable. -CHB _______________________________________________
Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/56YLYL... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker wrote: ...
Perhaps the OP wanted the internal array size initialized, but not used. Currently the internal array will automatically be reallocated to grow as needed. Which could be a performance hit if you know it’s going to grow large. But frankly, it would be a rare case where this would be noticeable. -CHB
Maybe uncommon, but I don't know about rare. Let's say you want to perform list-wise computations, making new lists with results of operations on existing lists (similar to numpy, but maybe trying to do something numpy is unsuitable for)? You would want to pre-allocate the new array to the size of the operand arrays.
On Thu, Feb 20, 2020 at 12:41 PM Steve Jorgensen <stevej@stevej.name> wrote:
Christopher Barker wrote: ...
Perhaps the OP wanted the internal array size initialized, but not used. Currently the internal array will automatically be reallocated to grow as needed. Which could be a performance hit if you know it’s going to grow large. But frankly, it would be a rare case where this would be noticeable. -CHB
Maybe uncommon, but I don't know about rare. Let's say you want to perform list-wise computations, making new lists with results of operations on existing lists (similar to numpy, but maybe trying to do something numpy is unsuitable for)? You would want to pre-allocate the new array to the size of the operand arrays.
Strong +1 for an array.zeros() constructor, and/or a lower level array.empty() which doesn't pre-fill values. A use case that came up for me recently is efficiently allocating and filling an object that satisfies the buffer protocol from C/C++ without requiring a NumPy dependency. As far as I can tell, there is no easy way to do this currently.
On Fri, Feb 21, 2020 at 8:52 AM Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Feb 20, 2020 at 12:41 PM Steve Jorgensen <stevej@stevej.name> wrote:
Christopher Barker wrote: ...
Perhaps the OP wanted the internal array size initialized, but not used. Currently the internal array will automatically be reallocated to grow as needed. Which could be a performance hit if you know it’s going to grow large. But frankly, it would be a rare case where this would be noticeable. -CHB
Maybe uncommon, but I don't know about rare. Let's say you want to perform list-wise computations, making new lists with results of operations on existing lists (similar to numpy, but maybe trying to do something numpy is unsuitable for)? You would want to pre-allocate the new array to the size of the operand arrays.
Strong +1 for an array.zeros() constructor, and/or a lower level array.empty() which doesn't pre-fill values.
So it'd be a shorthand for something like this?
array.array("i", bytes(64)) array('i', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
It'd be convenient to specify the size as a number of array elements rather than bytes. But I'm not a heavy user of array.array() so I won't say either way as to whether this is needed. ChrisA
On Thu, Feb 20, 2020 at 2:11 PM Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Feb 21, 2020 at 8:52 AM Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Feb 20, 2020 at 12:41 PM Steve Jorgensen <stevej@stevej.name>
Christopher Barker wrote: ...
Perhaps the OP wanted the internal array size initialized, but not
used.
Currently the internal array will automatically be reallocated to grow as needed. Which could be a performance hit if you know it’s going to grow large. But frankly, it would be a rare case where this would be noticeable. -CHB
Maybe uncommon, but I don't know about rare. Let's say you want to
wrote: perform list-wise computations, making new lists with results of operations on existing lists (similar to numpy, but maybe trying to do something numpy is unsuitable for)? You would want to pre-allocate the new array to the size of the operand arrays.
Strong +1 for an array.zeros() constructor, and/or a lower level
array.empty() which doesn't pre-fill values.
So it'd be a shorthand for something like this?
array.array("i", bytes(64)) array('i', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
It'd be convenient to specify the size as a number of array elements rather than bytes. But I'm not a heavy user of array.array() so I won't say either way as to whether this is needed.
Yes, exactly. The main problem with array.array("i", bytes(64)) is that memory gets allocated twice, first to create the bytes() object and then to make the array(). This makes it unsuitable for high performance applications.
On Thu, Feb 20, 2020 at 02:19:13PM -0800, Stephan Hoyer wrote:
Strong +1 for an array.zeros() constructor, and/or a lower level array.empty() which doesn't pre-fill values.
So it'd be a shorthand for something like this?
array.array("i", bytes(64)) array('i', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
It'd be convenient to specify the size as a number of array elements rather than bytes. But I'm not a heavy user of array.array() so I won't say either way as to whether this is needed.
Yes, exactly.
The main problem with array.array("i", bytes(64)) is that memory gets allocated twice, first to create the bytes() object and then to make the array(). This makes it unsuitable for high performance applications.
Got some actual measurements to demonstrate that initialising the array is a bottleneck? Especially for something as small as 64, it seems unlikely. If it were 64MB, that might be another story. What's wrong with `array.array("i", [0])*64` or equivalent? On my machine, at least, constructing a bytes object first followed by an array is significantly faster than the alternative: [steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', bytes(500000))" 100 loops, best of 5: 1.71 msec per loop [steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', [0])*500000" 50 loops, best of 5: 7.48 msec per loop That surprises me and I cannot explain it. -- Steven
On Feb 21, 2020, at 00:46, Steven D'Aprano <steve@pearwood.info> wrote:
Got some actual measurements to demonstrate that initialising the array is a bottleneck? Especially for something as small as 64, it seems unlikely. If it were 64MB, that might be another story.
What's wrong with `array.array("i", [0])*64` or equivalent?
On my machine, at least, constructing a bytes object first followed by an array is significantly faster than the alternative:
[steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', bytes(500000))" 100 loops, best of 5: 1.71 msec per loop
[steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', [0])*500000" 50 loops, best of 5: 7.48 msec per loop
That surprises me and I cannot explain it.
Without reading the code, I can guess. The first one does two 500K allocations and a 500K memcpy; the second only does one 500K allocation but does 150K separate 4-byte copies, and the added cost of that loop and of not moving as many bytes at a time as possible is higher than the savings of a 500K allocation.
21.02.20 10:36, Steven D'Aprano пише:
On my machine, at least, constructing a bytes object first followed by an array is significantly faster than the alternative:
[steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', bytes(500000))" 100 loops, best of 5: 1.71 msec per loop
[steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', [0])*500000" 50 loops, best of 5: 7.48 msec per loop
That surprises me and I cannot explain it.
The second one allocates and copies 4 times more memory.
On Fri, Feb 21, 2020 at 02:42:16PM +0200, Serhiy Storchaka wrote:
21.02.20 10:36, Steven D'Aprano пише:
On my machine, at least, constructing a bytes object first followed by an array is significantly faster than the alternative:
[steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', bytes(500000))" 100 loops, best of 5: 1.71 msec per loop
[steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', [0])*500000" 50 loops, best of 5: 7.48 msec per loop
That surprises me and I cannot explain it.
The second one allocates and copies 4 times more memory.
I completely misunderstood what the first would do. I expected it to create an array of 500,000 zeroes, but it only created an array of 125,000. That's nuts! The docstring says: Return a new array whose items are restricted by typecode, and initialized from the optional initializer value, which must be a list, string or iterable over elements of the appropriate type. A bytes object is an iterable over integers, so I expected these two to be equivalent: array('i', bytes(500000)) array('i', [0]*500000) I never would have predicted that a bytes iterable and a list iterable behave differently. Oh, you have to read the documentation on the website: https://docs.python.org/3/library/array.html#array.array Okay, let's try that again: [steve@ando cpython]$ ./python -m timeit -s "from array import array" "array('i', bytes(500000*4))" 20 loops, best of 5: 12.6 msec per loop compared to 7.65 milliseconds for the version using multiplication. That makes more sense to me. Okay, I'm starting to come around to giving array an alternate constructor: array.zeroes(typecode, size [, *, value=None]) If keyword-only argument value is given and is not None, it is used as the initial value instead of zero. -- Steven
On Fri, Feb 21, 2020 at 12:43 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Thu, Feb 20, 2020 at 02:19:13PM -0800, Stephan Hoyer wrote:
Strong +1 for an array.zeros() constructor, and/or a lower level array.empty() which doesn't pre-fill values.
So it'd be a shorthand for something like this?
array.array("i", bytes(64)) array('i', [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
It'd be convenient to specify the size as a number of array elements rather than bytes. But I'm not a heavy user of array.array() so I won't say either way as to whether this is needed.
Yes, exactly.
The main problem with array.array("i", bytes(64)) is that memory gets allocated twice, first to create the bytes() object and then to make the array(). This makes it unsuitable for high performance applications.
Got some actual measurements to demonstrate that initialising the array is a bottleneck? Especially for something as small as 64, it seems unlikely. If it were 64MB, that might be another story.
That's right, the real use-case is quickly deserializing large amounts of data (e.g., 100s of MB) from a wire format into a form suitable for fast analysis with NumPy or pandas. Unfortunately I can't share an actual code example, but this is a pretty common scenario in the data processing world, e.g., reminiscent of the use-cases for PEP 574 ( https://www.python.org/dev/peps/pep-0574/). The concern is not just speed (which I agree is probably not impacted too poorly by an extra copy) but also memory overhead. If the resulting array is 500 MB and deserialization can be done in a streaming fashion, I don't want to wastefully allocate another 500 MB just to do a memory copy.
On Thu, Feb 20, 2020 at 12:43 PM Steve Jorgensen <stevej@stevej.name> wrote:
But frankly, it would be a rare case where this would be noticeable. -CHB
Maybe uncommon, but I don't know about rare. Let's say you want to perform list-wise computations, making new lists with results of operations on existing lists (similar to numpy, but maybe trying to do something numpy is unsuitable for)? You would want to pre-allocate the new array to the size of the operand arrays.
Not rate that you’d have a use case, but rate that the performance would be in issue. In past experiments, I’ve found the array re-allocation scheme is remarkably performant. On the other hand, all the methods suggested in this thread require at least a double allocation— which may not be noticeable in many applications, but it’s also a fairly light lift to make a single constructer for a pre-allocated array. And as Stephan pointed out — it would help in some high performance situations. One thing to keep In mind is that array.array is useful for use from C/Cython, when you don’t want the overhead of numpy. -CHB
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2A2LOY... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
participants (14)
-
Alex Gaynor
-
Andrew Barnert
-
Antoine Pitrou
-
Chris Angelico
-
Christopher Barker
-
Georg Brandl
-
Oleg Broytman
-
Philip Jenvey
-
Serhiy Storchaka
-
Stephan Hoyer
-
Steve Jorgensen
-
Steve Jorgensen
-
Steven D'Aprano
-
Sven Rahmann