Mailman 3 Suggestion: special-case np.array(range(...)) to be faster - NumPy-Discussion

newer
Re: [Numpy-discussion] proposal:...

Suggestion: special-case np.array(range(...)) to be faster

older
proposal: new logspace without the...

Antony Lee

Feb. 13, 2016

8:57 p.m.

Compare (on Python3 -- for Python2, read "xrange" instead of "range"): In [2]: %timeit np.array(range(1000000), np.int64) 10 loops, best of 3: 156 ms per loop In [3]: %timeit np.arange(1000000, dtype=np.int64) 1000 loops, best of 3: 853 µs per loop Note that while iterating over a range is not very fast, it is still much better than the array creation: In [4]: from collections import deque In [5]: %timeit deque(range(1000000), 1) 10 loops, best of 3: 25.5 ms per loop On one hand, special cases are awful. On the other hand, the range builtin is probably important enough to deserve a special case to make this construction faster. Or not? I initially opened this as https://github.com/numpy/numpy/issues/7233 but it was suggested there that this should be discussed on the ML first. (The real issue which prompted this suggestion: I was building sparse matrices using scipy.sparse.csc_matrix with some indices specified using range, and that construction step turned out to take a significant portion of the time because of the calls to np.array). Antony

Attachments:

attachment.htm (text/html — 1.5 KB)

Show replies by date

josef.pktd＠gmail.com

February 2016

9:43 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

IMO: I don't see a reason why this should be supported. There is np.arange after all for this usecase, and from_iter. range and the other guys are iterators, and in several cases we can use larange = list(range(...)) as a short cut to get python list.for python 2/3 compatibility. I think this might be partially a learning effect in the python 2 to 3 transition. After using almost only python 3 for maybe a year, I don't think it's difficult to remember the differences when writing code that is py 2.7 and py 3.x compatible. It's just **another** thing to watch out for if milliseconds matter in your application. Josef

...

josef.pktd＠gmail.com

9:48 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sat, Feb 13, 2016 at 9:43 PM, <josef.pktd@gmail.com> wrote:

...

side question: Is there a simple way to distinguish a iterator or generator from an iterable data structure? Josef

...

Antony Lee

3:21 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

re: no reason why... This has nothing to do with Python2/Python3 (I personally stopped using Python2 at least 3 years ago.) Let me put it this way instead: if Python3's "range" (or Python2's "xrange") was not a builtin type but a type provided by numpy, I don't think it would be controversial at all to provide an `__array__` special method to efficiently convert it to a ndarray. It would be the same if `np.array` used a `functools.singledispatch` dispatcher rather than an `__array__` special method (which is obviously not possible for chronological reasons). re: iterable vs iterator: check for the presence of the __next__ special method (or isinstance(x, Iterable) vs. isinstance(x, Iterator) and not isinstance(x, Iterable)) Antony 2016-02-13 18:48 GMT-08:00 <josef.pktd@gmail.com>:

...

On Sat, Feb 13, 2016 at 9:43 PM, <josef.pktd@gmail.com> wrote:

...
On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
Compare (on Python3 -- for Python2, read "xrange" instead of "range"):

In [2]: %timeit np.array(range(1000000), np.int64) 10 loops, best of 3: 156 ms per loop

In [3]: %timeit np.arange(1000000, dtype=np.int64) 1000 loops, best of 3: 853 µs per loop

Note that while iterating over a range is not very fast, it is still much better than the array creation:

In [4]: from collections import deque

In [5]: %timeit deque(range(1000000), 1) 10 loops, best of 3: 25.5 ms per loop

On one hand, special cases are awful. On the other hand, the range builtin is probably important enough to deserve a special case to make this construction faster. Or not? I initially opened this as https://github.com/numpy/numpy/issues/7233 but it was suggested there that this should be discussed on the ML first.

(The real issue which prompted this suggestion: I was building sparse matrices using scipy.sparse.csc_matrix with some indices specified using range, and that construction step turned out to take a significant portion of the time because of the calls to np.array).

IMO: I don't see a reason why this should be supported. There is np.arange after all for this usecase, and from_iter. range and the other guys are iterators, and in several cases we can use larange = list(range(...)) as a short cut to get python list.for python 2/3 compatibility.

I think this might be partially a learning effect in the python 2 to 3 transition. After using almost only python 3 for maybe a year, I don't think it's difficult to remember the differences when writing code that is py 2.7 and py 3.x compatible.

It's just **another** thing to watch out for if milliseconds matter in your application.

side question: Is there a simple way to distinguish a iterator or generator from an iterable data structure?

Josef

...
Josef

...
Antony

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

9:28 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 3:21 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

But numpy does provide arange. What's the reason to not use np.arange and use an iterator instead?

...

AFAIR and from spot checking the mailing list, in the past the argument was that it's too complicated to mix array/asarray creation with fromiter building of arrays. (I have no idea if array could cheaply delegate to fromiter.) Josef

...

Antony

2016-02-13 18:48 GMT-08:00 <josef.pktd@gmail.com>:

...
On Sat, Feb 13, 2016 at 9:43 PM, <josef.pktd@gmail.com> wrote:

...
On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
Compare (on Python3 -- for Python2, read "xrange" instead of "range"):

In [2]: %timeit np.array(range(1000000), np.int64) 10 loops, best of 3: 156 ms per loop

In [3]: %timeit np.arange(1000000, dtype=np.int64) 1000 loops, best of 3: 853 µs per loop

Note that while iterating over a range is not very fast, it is still much better than the array creation:

In [4]: from collections import deque

In [5]: %timeit deque(range(1000000), 1) 10 loops, best of 3: 25.5 ms per loop

On one hand, special cases are awful. On the other hand, the range builtin is probably important enough to deserve a special case to make this construction faster. Or not? I initially opened this as https://github.com/numpy/numpy/issues/7233 but it was suggested there that this should be discussed on the ML first.

(The real issue which prompted this suggestion: I was building sparse matrices using scipy.sparse.csc_matrix with some indices specified using range, and that construction step turned out to take a significant portion of the time because of the calls to np.array).

IMO: I don't see a reason why this should be supported. There is np.arange after all for this usecase, and from_iter. range and the other guys are iterators, and in several cases we can use larange = list(range(...)) as a short cut to get python list.for python 2/3 compatibility.

I think this might be partially a learning effect in the python 2 to 3 transition. After using almost only python 3 for maybe a year, I don't think it's difficult to remember the differences when writing code that is py 2.7 and py 3.x compatible.

It's just **another** thing to watch out for if milliseconds matter in your application.

side question: Is there a simple way to distinguish a iterator or generator from an iterable data structure?

Josef

...
Josef

...
Antony

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

Ralf Gommers

9:36 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 9:21 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

I think it's good to do something about this, but it's not clear what the exact proposal is. I could image one or both of: - special-case the range() object in array (and asarray/asanyarray?) such that array(range(N)) becomes as fast as arange(N). - special-case all iterators, such that array(range(N)) becomes as fast as deque(range(N)) or yet something else? Ralf

Antony Lee

3:23 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

I was thinking (1) (special-case range()); however (2) may be more generally applicable and useful. Antony 2016-02-14 6:36 GMT-08:00 Ralf Gommers <ralf.gommers@gmail.com>:

...

Charles R Harris

4:36 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 7:36 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

I think the last wouldn't help much, as numpy would still need to determine dimensions and type. I assume that is one of the reason sparse itself doesn't do that. Chuck

Ralf Gommers

1:21 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 10:36 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...

Not orders of magnitude, but this shows that there's something to optimize for iterators: In [1]: %timeit np.array(range(100000)) 100 loops, best of 3: 14.9 ms per loop In [2]: %timeit np.array(list(range(100000))) 100 loops, best of 3: 9.68 ms per loop Ralf

Antony Lee

2:41 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

I wonder whether numpy is using the "old" iteration protocol (repeatedly calling x[i] for increasing i until StopIteration is reached?) A quick timing shows that it is indeed slower. ... actually it's not even clear to me what qualifies as a sequence for `np.array`: class C: def __iter__(self): return iter(range(10)) # [0... 9] under the new iteration protocol def __getitem__(self, i): raise IndexError # [] under the old iteration protocol np.array(C()) ===> array(<__main__.C object at 0x7f3f21ffff28>, dtype=object) So how can np.array(range(...)) even work? 2016-02-14 22:21 GMT-08:00 Ralf Gommers <ralf.gommers@gmail.com>:

...

Sebastian Berg

3:07 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On So, 2016-02-14 at 23:41 -0800, Antony Lee wrote:

...

Numpy currently uses PySequence_Fast, but it has to do a two pass algorithm (find dtype+dims), and the range is converted twice to list by this call. That explains the speed advantage of converting to list manually. - Sebastian

...

np.array(C()) ===> array(<__main__.C object at 0x7f3f21ffff28>, dtype=object)

So how can np.array(range(...)) even work?

2016-02-14 22:21 GMT-08:00 Ralf Gommers <ralf.gommers@gmail.com>:

...
On Sun, Feb 14, 2016 at 10:36 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...
On Sun, Feb 14, 2016 at 7:36 AM, Ralf Gommers < ralf.gommers@gmail.com> wrote:

...
On Sun, Feb 14, 2016 at 9:21 AM, Antony Lee < antony.lee@berkeley.edu> wrote:

...
re: no reason why... This has nothing to do with Python2/Python3 (I personally stopped using Python2 at least 3 years ago.) Let me put it this way instead: if Python3's "range" (or Python2's "xrange") was not a builtin type but a type provided by numpy, I don't think it would be controversial at all to provide an `__array__` special method to efficiently convert it to a ndarray. It would be the same if `np.array` used a `functools.singledispatch` dispatcher rather than an `__array__` special method (which is obviously not possible for chronological reasons).

re: iterable vs iterator: check for the presence of the __next__ special method (or isinstance(x, Iterable) vs. isinstance(x, Iterator) and not isinstance(x, Iterable))

I think it's good to do something about this, but it's not clear what the exact proposal is. I could image one or both of:

- special-case the range() object in array (and asarray/asanyarray?) such that array(range(N)) becomes as fast as arange(N). - special-case all iterators, such that array(range(N)) becomes as fast as deque(range(N))

I think the last wouldn't help much, as numpy would still need to determine dimensions and type. I assume that is one of the reason sparse itself doesn't do that.

Not orders of magnitude, but this shows that there's something to optimize for iterators:

In [1]: %timeit np.array(range(100000)) 100 loops, best of 3: 14.9 ms per loop

In [2]: %timeit np.array(list(range(100000))) 100 loops, best of 3: 9.68 ms per loop

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

3:10 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 11:41 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Yeah, I'm pretty sure that np.array doesn't know anything about "iterable", just about "sequence" (calling x[i] for 0 <= i < i.__len__()). (See Sequence vs Iterable: https://docs.python.org/3/library/collections.abc.html) Personally I'd like it if we could eventually make it so np.array specifically looks for lists and only lists, because the way it has so many different fallbacks right now creates all confusion between which objects are elements. Compare: In [5]: np.array([(1, 2), (3, 4)]).shape Out[5]: (2, 2) In [6]: np.array([(1, 2), (3, 4)], dtype="i4,i4").shape Out[6]: (2,) -n -- Nathaniel J. Smith -- https://vorpus.org

Antony Lee

11:13 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

Indeed: In [1]: class C: def __getitem__(self, i): if i < 10: return i else: raise IndexError def __len__(self): return 10 ...: In [2]: np.array(C()) Out[2]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) (omitting __len__ results in the creation of an object array, consistently with the fact that the sequence protocol requires __len__). Meanwhile, I found a new way to segfault numpy :-) In [3]: class C: def __getitem__(self, i): if i < 10: return i else: raise IndexError def __len__(self): return 42 ...: In [4]: np.array(C()) Fatal Python error: Segmentation fault 2016-02-15 0:10 GMT-08:00 Nathaniel Smith <njs@pobox.com>:

...

Jeff Reback

11:24 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

just an FYI. pandas implemented a RangeIndex in upcoming 0.18.0, mainly for memory savings, see here <http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#range-index>, similar to how python range/xrange work. though there are substantial perf benefits, mainly with set operations, see here <https://github.com/pydata/pandas/blob/master/pandas/indexes/range.py#L274> though didn't officially benchmark thes. Jeff On Mon, Feb 15, 2016 at 11:13 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Robert Kern

11:49 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Mon, Feb 15, 2016 at 4:24 PM, Jeff Reback <jeffreback@gmail.com> wrote:

...

just an FYI.

pandas implemented a RangeIndex in upcoming 0.18.0, mainly for memory

savings,

...

Since it is a numpy-aware object (unlike the builtins), you can (and have, if I'm reading the code correctly) implement __array__() such that it does the correctly performant thing and call np.arange(). RangeIndex won't be adversely impacted by retaining the status quo. -- Robert Kern

Chris Barker

1:50 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 11:41 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

So how can np.array(range(...)) even work?

range() (in py3) is not a generator, nor is is a iterator. it is a range object, which is lazily evaluated, and satisfies both the iterator protocol and the sequence protocol (at least most of it: In [*1*]: r = range(10) In [*2*]: r[3] Out[*2*]: 3 In [*3*]: len(r) Out[*3*]: 10 In [*4*]: type(r) Out[*4*]: range In [*9*]: isinstance(r, collections.abc.Sequence) Out[*9*]: True In [*10*]: l = list() In [*11*]: isinstance(l, collections.abc.Sequence) Out[*11*]: True In [*12*]: isinstance(r, collections.abc.Iterable) Out[*12*]: True I'm still totally confused as to why we'd need to special-case range when we have arange(). -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Antony Lee

1:15 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

Mostly so that there is no performance lost when someone passes range(...) instead of np.arange(...). At least I had never realized that one is much faster than the other and always just passed range() as a convenience. Antony 2016-02-17 10:50 GMT-08:00 Chris Barker <chris.barker@noaa.gov>:

...

josef.pktd＠gmail.com

2:12 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Thu, Feb 18, 2016 at 1:15 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

thanks, I didn't know that the range r here doesn't get eaten by iterating through it while r = (i for i in range(5)) is only good for a single pass. (tried on python 3.4) Josef

...

Chris Barker

5:21 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Thu, Feb 18, 2016 at 10:15 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Well, pretty much everything in numpy is faster if you use the numpy array version rather than plain python -- this hardly seems like the extra code would be worth it. numpy's array() constructor can (and should) take an arbitrary iterable. It does make some sense that you we might want to special case iterators, as you don't want to loop through them too many times, which is what np.fromiter() is for. and _maybe_ it would be worth special casing python lists, as you can access items faster, and they are really, really common (or has this already been done?), but special casing range() is getting silly. And it might be hard to do. At the C level I suppose you could actually know what the parameters and state of the range object are and create an array directly from that -- but that's what arange is for... -CHB

...

-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Antony Lee

5:46 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

In a sense this discussion is really about making np.array(iterable) more efficient, so I restarted the discussion at https://mail.scipy.org/pipermail/numpy-discussion/2016-February/075059.html Antony 2016-02-18 14:21 GMT-08:00 Chris Barker <chris.barker@noaa.gov>:

...

On Thu, Feb 18, 2016 at 10:15 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
Mostly so that there is no performance lost when someone passes range(...) instead of np.arange(...). At least I had never realized that one is much faster than the other and always just passed range() as a convenience.

Well, pretty much everything in numpy is faster if you use the numpy array version rather than plain python -- this hardly seems like the extra code would be worth it.

numpy's array() constructor can (and should) take an arbitrary iterable.

It does make some sense that you we might want to special case iterators, as you don't want to loop through them too many times, which is what np.fromiter() is for.

and _maybe_ it would be worth special casing python lists, as you can access items faster, and they are really, really common (or has this already been done?), but special casing range() is getting silly. And it might be hard to do. At the C level I suppose you could actually know what the parameters and state of the range object are and create an array directly from that -- but that's what arange is for...

-CHB

...
2016-02-17 10:50 GMT-08:00 Chris Barker <chris.barker@noaa.gov>:

...
On Sun, Feb 14, 2016 at 11:41 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
So how can np.array(range(...)) even work?

range() (in py3) is not a generator, nor is is a iterator. it is a range object, which is lazily evaluated, and satisfies both the iterator protocol and the sequence protocol (at least most of it:

In [*1*]: r = range(10)

In [*2*]: r[3]

Out[*2*]: 3

In [*3*]: len(r)

Out[*3*]: 10

In [*4*]: type(r)

Out[*4*]: range

In [*9*]: isinstance(r, collections.abc.Sequence)

Out[*9*]: True

In [*10*]: l = list()

In [*11*]: isinstance(l, collections.abc.Sequence)

Out[*11*]: True

In [*12*]: isinstance(r, collections.abc.Iterable)

Out[*12*]: True I'm still totally confused as to why we'd need to special-case range when we have arange().

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

February 2016

9:43 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

josef.pktd＠gmail.com

9:48 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sat, Feb 13, 2016 at 9:43 PM, <josef.pktd@gmail.com> wrote:

...

side question: Is there a simple way to distinguish a iterator or generator from an iterable data structure? Josef

...

Antony Lee

3:21 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

...

On Sat, Feb 13, 2016 at 9:43 PM, <josef.pktd@gmail.com> wrote:

...
On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
Compare (on Python3 -- for Python2, read "xrange" instead of "range"):

In [2]: %timeit np.array(range(1000000), np.int64) 10 loops, best of 3: 156 ms per loop

In [3]: %timeit np.arange(1000000, dtype=np.int64) 1000 loops, best of 3: 853 µs per loop

Note that while iterating over a range is not very fast, it is still much better than the array creation:

In [4]: from collections import deque

In [5]: %timeit deque(range(1000000), 1) 10 loops, best of 3: 25.5 ms per loop

On one hand, special cases are awful. On the other hand, the range builtin is probably important enough to deserve a special case to make this construction faster. Or not? I initially opened this as https://github.com/numpy/numpy/issues/7233 but it was suggested there that this should be discussed on the ML first.

(The real issue which prompted this suggestion: I was building sparse matrices using scipy.sparse.csc_matrix with some indices specified using range, and that construction step turned out to take a significant portion of the time because of the calls to np.array).

IMO: I don't see a reason why this should be supported. There is np.arange after all for this usecase, and from_iter. range and the other guys are iterators, and in several cases we can use larange = list(range(...)) as a short cut to get python list.for python 2/3 compatibility.

I think this might be partially a learning effect in the python 2 to 3 transition. After using almost only python 3 for maybe a year, I don't think it's difficult to remember the differences when writing code that is py 2.7 and py 3.x compatible.

It's just **another** thing to watch out for if milliseconds matter in your application.

side question: Is there a simple way to distinguish a iterator or generator from an iterable data structure?

Josef

...
Josef

...
Antony

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

josef.pktd＠gmail.com

9:28 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 3:21 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

But numpy does provide arange. What's the reason to not use np.arange and use an iterator instead?

...

Antony

2016-02-13 18:48 GMT-08:00 <josef.pktd@gmail.com>:

...
On Sat, Feb 13, 2016 at 9:43 PM, <josef.pktd@gmail.com> wrote:

...
On Sat, Feb 13, 2016 at 8:57 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
Compare (on Python3 -- for Python2, read "xrange" instead of "range"):

In [2]: %timeit np.array(range(1000000), np.int64) 10 loops, best of 3: 156 ms per loop

In [3]: %timeit np.arange(1000000, dtype=np.int64) 1000 loops, best of 3: 853 µs per loop

Note that while iterating over a range is not very fast, it is still much better than the array creation:

In [4]: from collections import deque

In [5]: %timeit deque(range(1000000), 1) 10 loops, best of 3: 25.5 ms per loop

On one hand, special cases are awful. On the other hand, the range builtin is probably important enough to deserve a special case to make this construction faster. Or not? I initially opened this as https://github.com/numpy/numpy/issues/7233 but it was suggested there that this should be discussed on the ML first.

(The real issue which prompted this suggestion: I was building sparse matrices using scipy.sparse.csc_matrix with some indices specified using range, and that construction step turned out to take a significant portion of the time because of the calls to np.array).

IMO: I don't see a reason why this should be supported. There is np.arange after all for this usecase, and from_iter. range and the other guys are iterators, and in several cases we can use larange = list(range(...)) as a short cut to get python list.for python 2/3 compatibility.

I think this might be partially a learning effect in the python 2 to 3 transition. After using almost only python 3 for maybe a year, I don't think it's difficult to remember the differences when writing code that is py 2.7 and py 3.x compatible.

It's just **another** thing to watch out for if milliseconds matter in your application.

side question: Is there a simple way to distinguish a iterator or generator from an iterable data structure?

Josef

...
Josef

...
Antony

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

Ralf Gommers

9:36 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 9:21 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Antony Lee

3:23 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

I was thinking (1) (special-case range()); however (2) may be more generally applicable and useful. Antony 2016-02-14 6:36 GMT-08:00 Ralf Gommers <ralf.gommers@gmail.com>:

...

Charles R Harris

February 2016

4:36 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 7:36 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

I think the last wouldn't help much, as numpy would still need to determine dimensions and type. I assume that is one of the reason sparse itself doesn't do that. Chuck

Ralf Gommers

1:21 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 10:36 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...

Antony Lee

2:41 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

...

Sebastian Berg

3:07 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On So, 2016-02-14 at 23:41 -0800, Antony Lee wrote:

...

np.array(C()) ===> array(<__main__.C object at 0x7f3f21ffff28>, dtype=object)

So how can np.array(range(...)) even work?

2016-02-14 22:21 GMT-08:00 Ralf Gommers <ralf.gommers@gmail.com>:

...
On Sun, Feb 14, 2016 at 10:36 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:

...
On Sun, Feb 14, 2016 at 7:36 AM, Ralf Gommers < ralf.gommers@gmail.com> wrote:

...
On Sun, Feb 14, 2016 at 9:21 AM, Antony Lee < antony.lee@berkeley.edu> wrote:

...
re: no reason why... This has nothing to do with Python2/Python3 (I personally stopped using Python2 at least 3 years ago.) Let me put it this way instead: if Python3's "range" (or Python2's "xrange") was not a builtin type but a type provided by numpy, I don't think it would be controversial at all to provide an `__array__` special method to efficiently convert it to a ndarray. It would be the same if `np.array` used a `functools.singledispatch` dispatcher rather than an `__array__` special method (which is obviously not possible for chronological reasons).

re: iterable vs iterator: check for the presence of the __next__ special method (or isinstance(x, Iterable) vs. isinstance(x, Iterator) and not isinstance(x, Iterable))

I think it's good to do something about this, but it's not clear what the exact proposal is. I could image one or both of:

- special-case the range() object in array (and asarray/asanyarray?) such that array(range(N)) becomes as fast as arange(N). - special-case all iterators, such that array(range(N)) becomes as fast as deque(range(N))

I think the last wouldn't help much, as numpy would still need to determine dimensions and type. I assume that is one of the reason sparse itself doesn't do that.

Not orders of magnitude, but this shows that there's something to optimize for iterators:

In [1]: %timeit np.array(range(100000)) 100 loops, best of 3: 14.9 ms per loop

In [2]: %timeit np.array(list(range(100000))) 100 loops, best of 3: 9.68 ms per loop

Ralf

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

3:10 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 11:41 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Antony Lee

11:13 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

...

Jeff Reback

February 2016

11:24 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

...

Robert Kern

11:49 a.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Mon, Feb 15, 2016 at 4:24 PM, Jeff Reback <jeffreback@gmail.com> wrote:

...

just an FYI.

pandas implemented a RangeIndex in upcoming 0.18.0, mainly for memory

savings,

...

Chris Barker

1:50 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Sun, Feb 14, 2016 at 11:41 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

So how can np.array(range(...)) even work?

Antony Lee

1:15 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

...

josef.pktd＠gmail.com

2:12 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Thu, Feb 18, 2016 at 1:15 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

thanks, I didn't know that the range r here doesn't get eaten by iterating through it while r = (i for i in range(5)) is only good for a single pass. (tried on python 3.4) Josef

...

Chris Barker

5:21 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

On Thu, Feb 18, 2016 at 10:15 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...

Antony Lee

February 2016

5:46 p.m.

New subject: Suggestion: special-case np.array(range(...)) to be faster

...

On Thu, Feb 18, 2016 at 10:15 AM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
Mostly so that there is no performance lost when someone passes range(...) instead of np.arange(...). At least I had never realized that one is much faster than the other and always just passed range() as a convenience.

Well, pretty much everything in numpy is faster if you use the numpy array version rather than plain python -- this hardly seems like the extra code would be worth it.

numpy's array() constructor can (and should) take an arbitrary iterable.

It does make some sense that you we might want to special case iterators, as you don't want to loop through them too many times, which is what np.fromiter() is for.

and _maybe_ it would be worth special casing python lists, as you can access items faster, and they are really, really common (or has this already been done?), but special casing range() is getting silly. And it might be hard to do. At the C level I suppose you could actually know what the parameters and state of the range object are and create an array directly from that -- but that's what arange is for...

-CHB

...
2016-02-17 10:50 GMT-08:00 Chris Barker <chris.barker@noaa.gov>:

...
On Sun, Feb 14, 2016 at 11:41 PM, Antony Lee <antony.lee@berkeley.edu> wrote:

...
So how can np.array(range(...)) even work?

range() (in py3) is not a generator, nor is is a iterator. it is a range object, which is lazily evaluated, and satisfies both the iterator protocol and the sequence protocol (at least most of it:

In [*1*]: r = range(10)

In [*2*]: r[3]

Out[*2*]: 3

In [*3*]: len(r)

Out[*3*]: 10

In [*4*]: type(r)

Out[*4*]: range

In [*9*]: isinstance(r, collections.abc.Sequence)

Out[*9*]: True

In [*10*]: l = list()

In [*11*]: isinstance(l, collections.abc.Sequence)

Out[*11*]: True

In [*12*]: isinstance(r, collections.abc.Iterable)

Out[*12*]: True I'm still totally confused as to why we'd need to special-case range when we have arange().

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

3312

Age (days ago)

3316

Last active (days ago)

List overview

Download

19 comments

9 participants

participants (9)

Antony Lee
Charles R Harris
Chris Barker
Jeff Reback
josef.pktd＠gmail.com
Nathaniel Smith
Ralf Gommers
Robert Kern
Sebastian Berg

Suggestion: special-case np.array(range(...)) to be faster

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

Antony Lee

tags

participants (9)