FeatureRequest: support for array construction from iterators
[ this request /discussion refers to numpy issue #5863 https://github.com/numpy/numpy/pull/5863#issuecomment159738368 ] Dear all, As far as I can think, the expected functionality of np.array(...) would be np.array(list(...)) or something even nicer. Therefore, I like to request a generator/iterator support for np.array(...) as far as list(...) supports it. A more detailed reasoning behind this follows now. In general it seems possible to identify iterators/generators as needed for this purpose:  someone actually implemented this feature already (see #5863 <https://mail.google.com/mail/u/0/%E2%80%8Bhttps://github.com/numpy/numpy/pull/5863#issuecomment159738368> )  there is ``type.GeneratorType`` and ``collections.abc.Iterator`` for ``isinstance(...)`` check  numpy can destinguish them already from all other types which get well translated into a numpy array Given this, I think the general argument goes roughly like the following: PROS (effect maybe 10% of numpy user or more):  more intuitive overall behaviour, array(...) = array(list(...)) roughly  python3 compatibility (see e.g. #5951 <https://github.com/numpy/numpy/issues/5951>)  compatibility with analog ``__builtin__`` functions (see e.g. #5756 <https://github.com/numpy/numpy/issues/5756>)  all the above make numpy easier to use in an interactive style (e.g. ipython pylab) (computation not that important, however coding time well) CONS (effect less than 0.1% numpy user I would guess):  might break existing code which in total, at least for me at this stage, speaks in favour of merging the already existing featurebranch (see #5863 <https://mail.google.com/mail/u/0/%E2%80%8Bhttps://github.com/numpy/numpy/pull/5863#issuecomment159738368> ) or something similar into numpy master . Discussion, please! cheers, Stephan
numpy.fromiter is neither numpy.array nor does it work similar to numpy.array(list(...)) as the dtype argument is necessary is there a reason, why np.array(...) should not work on iterators? I have the feeling that such requests get (repeatedly) dismissed, but until yet I haven't found a compelling argument for leaving this Feature missing (to remember, it is already implemented in a branch) Please let me know if you know about an argument, best, Stephan On 27 November 2015 at 14:18, Alan G Isaac <alan.isaac@gmail.com> wrote:
Constructing an array from an iterator is fundamentally different from constructing an array from an inmemory data structure like a list, because in the iterator case it's necessary to either use a singlepass algorithm or else create extra temporary buffers that cause much higher memory overhead. (Which is undesirable given that iterators are mostly used exactly in the case where one wants to reduce memory overhead.) np.fromiter requires the dtype= argument because this is necessary if you want to construct the array in a single pass. np.array(list(iter)) can avoid the dtype argument, because it creates that large memory buffer. IMO this is better than making np.array(iter) internally call list(iter) or equivalent, because the workaround (adding an explicit call to list()) is trivial, while also making it obvious to the user what the actual cost of their request is. (Explicit is better than implicit.) In addition, the proposed API has a number of infelicities:  We're generally trying to *reduce* the magic in functions like np.array (e.g. the discussions of having less magic for lists with mismatched numbers of elements, or nonlist sequences)  There's a strong convention in Python is when making a function like np.array generic, it should accept any iter*able* rather any iter*ator*. But it would be super confusing if np.array({1: 2}) returned array([1]), or if array("foo") returned array(["f", "o", "o"]), so we don't actually want to handle all iterables the same. It's somewhat dubious even for iterators (e.g. someone might want to create an object array containing an iterator...)... hope that helps, n On Fri, Dec 11, 2015 at 2:27 PM, Stephan Sahm <Stephan.Sahm@gmx.de> wrote:
Yeah but that's not the only option: from itertools import chain def fromiter_awesome_edition(iterable): elem = next(iterable) dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem) return np.fromiter(chain([elem], iterable), dtype=dtype) I think this would be a huge win for usability. Always getting tripped up by the dtype requirement. I can submit a PR if people like this pattern. btw, I think np.array(['f', 'o', 'o']) would be exactly the expected result for np.array('foo'), but I guess that's just me. Juan. On Sat, Dec 12, 2015 at 10:12 AM, Nathaniel Smith <njs@pobox.com> wrote:
Constructing an array from an iterator is fundamentally different from constructing an array from an inmemory data structure like a list, because in the iterator case it's necessary to either use a singlepass algorithm or else create extra temporary buffers that cause much higher memory overhead. (Which is undesirable given that iterators are mostly used exactly in the case where one wants to reduce memory overhead.)
np.fromiter requires the dtype= argument because this is necessary if you want to construct the array in a single pass.
np.array(list(iter)) can avoid the dtype argument, because it creates that large memory buffer. IMO this is better than making np.array(iter) internally call list(iter) or equivalent, because the workaround (adding an explicit call to list()) is trivial, while also making it obvious to the user what the actual cost of their request is. (Explicit is better than implicit.)
In addition, the proposed API has a number of infelicities:  We're generally trying to *reduce* the magic in functions like np.array (e.g. the discussions of having less magic for lists with mismatched numbers of elements, or nonlist sequences)  There's a strong convention in Python is when making a function like np.array generic, it should accept any iter*able* rather any iter*ator*. But it would be super confusing if np.array({1: 2}) returned array([1]), or if array("foo") returned array(["f", "o", "o"]), so we don't actually want to handle all iterables the same. It's somewhat dubious even for iterators (e.g. someone might want to create an object array containing an iterator...)...
from itertools import chain def fromiter_awesome_edition(iterable): elem = next(iterable) dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem) return np.fromiter(chain([elem], iterable), dtype=dtype)
I think this would be a huge win for usability. Always getting tripped up by the dtype requirement. I can submit a PR if people like this pattern.
This isn't the semantics of np.array, though  np.array will look at the whole input and try to find a common dtype, so this can't be the implementation for np.array(iter). E.g. try np.array([1, 1.0]) I can see an argument for making the dtype= argument to fromiter optional, with a warning in the docs that it will guess based on the first element and that you should specify it if you don't want that. It seems potentially a bit error prone (in the sense that it might make it easier to end up with code that works great when you test it but then breaks later when something unexpected happens), but maybe the usability outweighs that. I don't use fromiter myself so I don't have a strong opinion.
btw, I think np.array(['f', 'o', 'o']) would be exactly the expected result for np.array('foo'), but I guess that's just me.
In general np.array(thing_that_can_go_inside_an_array) returns a zerodimensional (scalar) array  np.array(1), np.array(True), etc. all work like this, so I'd expect np.array("foo") to do the same. n  Nathaniel J. Smith  http://vorpus.org
Hey Nathaniel, Fascinating! Thanks for the primer! I didn't know that it would check dtype of values in the whole array. In that case, I would agree that it would be bad to infer it magically from just the first value, and this can be left to the users. Thanks! Juan. On Sat, Dec 12, 2015 at 7:00 PM, Nathaniel Smith <njs@pobox.com> wrote:
from itertools import chain def fromiter_awesome_edition(iterable): elem = next(iterable) dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem) return np.fromiter(chain([elem], iterable), dtype=dtype)
I think this would be a huge win for usability. Always getting tripped up by the dtype requirement. I can submit a PR if people like this pattern.
This isn't the semantics of np.array, though  np.array will look at the whole input and try to find a common dtype, so this can't be the implementation for np.array(iter). E.g. try np.array([1, 1.0])
I can see an argument for making the dtype= argument to fromiter optional, with a warning in the docs that it will guess based on the first element and that you should specify it if you don't want that. It seems potentially a bit error prone (in the sense that it might make it easier to end up with code that works great when you test it but then breaks later when something unexpected happens), but maybe the usability outweighs that. I don't use fromiter myself so I don't have a strong opinion.
btw, I think np.array(['f', 'o', 'o']) would be exactly the expected result for np.array('foo'), but I guess that's just me.
In general np.array(thing_that_can_go_inside_an_array) returns a zerodimensional (scalar) array  np.array(1), np.array(True), etc. all work like this, so I'd expect np.array("foo") to do the same.
Devil's advocate here: np.array() has become the defacto "constructor" for numpy arrays. Right now, passing it a generator results in what, IMHO, is a useless result:
np.array((i for i in range(10))) array(<generator object <genexpr> at 0x7f28b2beca00>, dtype=object)
Passing pretty much any dtype argument will cause that to fail:
np.array((i for i in range(10)), dtype=np.int_) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: long() argument must be a string or a number, not 'generator'
Therefore, I think it is not out of the realm of reason that passing a generator object and a dtype could then delegate the work under the hood to np.fromiter()? I would even go so far as to raise an error if one passes a generator without specifying dtype to np.array(). The point is to reduce the number of entry points for creating numpy arrays. By the way, any reason why this works?
np.array(xrange(10)) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Fascinating! Thanks for the primer! I didn't know that it would check dtype of values in the whole array. In that case, I would agree that it would be bad to infer it magically from just the first value, and this can be left to the users.
from itertools import chain def fromiter_awesome_edition(iterable): elem = next(iterable) dtype = whatever_numpy_does_to_infer_dtypes_from_lists(elem) return np.fromiter(chain([elem], iterable), dtype=dtype)
I think this would be a huge win for usability. Always getting tripped up by the dtype requirement. I can submit a PR if people like this pattern.
This isn't the semantics of np.array, though  np.array will look at the whole input and try to find a common dtype, so this can't be the implementation for np.array(iter). E.g. try np.array([1, 1.0])
I can see an argument for making the dtype= argument to fromiter optional, with a warning in the docs that it will guess based on the first element and that you should specify it if you don't want that. It seems potentially a bit error prone (in the sense that it might make it easier to end up with code that works great when you test it but then breaks later when something unexpected happens), but maybe the usability outweighs that. I don't use fromiter myself so I don't have a strong opinion.
btw, I think np.array(['f', 'o', 'o']) would be exactly the expected result for np.array('foo'), but I guess that's just me.
In general np.array(thing_that_can_go_inside_an_array) returns a zerodimensional (scalar) array  np.array(1), np.array(True), etc. all work like this, so I'd expect np.array("foo") to do the same.
np.array(xrange(10)) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
It's not a generator. It's a true sequence that just happens to have a special implementation rather than being a generic container.
len(xrange(10)) 10 xrange(10)[5] 5
 Robert Kern
Heh, never noticed that. Was it implemented more like a generator/iterator in older versions of Python? Thanks, Ben Root On Mon, Dec 14, 2015 at 12:38 PM, Robert Kern <robert.kern@gmail.com> wrote:
np.array(xrange(10)) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
It's not a generator. It's a true sequence that just happens to have a special implementation rather than being a generic container.
len(xrange(10)) 10 xrange(10)[5] 5
 Robert Kern
generator/iterator in older versions of Python? No, it predates generators and iterators so it has always had to be implemented like that.  Robert Kern
I would like to further push Benjamin Root's suggestion: "Therefore, I think it is not out of the realm of reason that passing a generator object and a dtype could then delegate the work under the hood to np.fromiter()? I would even go so far as to raise an error if one passes a generator without specifying dtype to np.array(). The point is to reduce the number of entry points for creating numpy arrays." would this be ok? On Mon, Dec 14, 2015 at 6:50 PM Robert Kern <robert.kern@gmail.com> wrote:
generator/iterator in older versions of Python?
No, it predates generators and iterators so it has always had to be implemented like that.
just to not prevent it from the black hole  what about integrating fromiter into array? (see the post by Benjamin Root) for me personally, taking the first element for deducing the dtype would be a perfect default way to read generators. If one wants a specific other dtype, one could specify it like in the current fromiter method. On 15 December 2015 at 08:08, Stephan Sahm <Stephan.Sahm@gmx.de> wrote:
"Therefore, I think it is not out of the realm of reason that passing a generator object and a dtype could then delegate the work under the hood to np.fromiter()? I would even go so far as to raise an error if one passes a generator without specifying dtype to np.array(). The point is to reduce the number of entry points for creating numpy arrays."
generator/iterator in older versions of Python?
No, it predates generators and iterators so it has always had to be implemented like that.
Actually, while working on https://github.com/numpy/numpy/issues/7264 I realized that the memory efficiency (onepass) argument is simply incorrect: import numpy as np class A: def __getitem__(self, i): print("A get item", i) return [np.int8(1), np.int8(2)][i] def __len__(self): return 2 print(repr(np.array(A()))) This prints out A get item 0 A get item 1 A get item 2 A get item 0 A get item 1 A get item 2 A get item 0 A get item 1 A get item 2 array([1, 2], dtype=int8) i.e. the sequence is "turned into a concrete sequence" no less than 3 times. Antony 20160119 11:33 GMT08:00 Stephan Sahm <Stephan.Sahm@gmx.de>:
just to not prevent it from the black hole  what about integrating fromiter into array? (see the post by Benjamin Root)
for me personally, taking the first element for deducing the dtype would be a perfect default way to read generators. If one wants a specific other dtype, one could specify it like in the current fromiter method.
"Therefore, I think it is not out of the realm of reason that passing a generator object and a dtype could then delegate the work under the hood to np.fromiter()? I would even go so far as to raise an error if one passes a generator without specifying dtype to np.array(). The point is to reduce the number of entry points for creating numpy arrays."
generator/iterator in older versions of Python?
No, it predates generators and iterators so it has always had to be implemented like that.
