Mailman 3 Histograms via indirect index arrays - NumPy-Discussion

Histograms via indirect index arrays

older
RE: [Numpy-discussion] USING NUMPY...

Mads Ipsen

14 Mar 2006 14 Mar '06

1:24 p.m.

Hey, First of all, thanks for the new release. Here's another question regarding something I cannot quite understand: Suppose you want to update bins for a histogram, you might think you could do something like: g = zeros(4,Int) x = array([0.2, 0.2]) idx = floor(x/0.1).astype(int) g[idx] += 1 Here idx becomes array([2, 2]) In this case, I would naively expect g to end up like array([0, 0, 2, 0]) (1) but instead you get array([0, 0, 1, 0]) (2) Is this intended? Just being plain novice-like naive, I would expect the slice operation g[idx] += 1 to do something like for i in range(len(I)): g[ idx[i] ] += 1 resulting in (1) and not (2). // Mads

Show replies by date

Norbert Nemec

16 Mar 16 Mar

8:01 a.m.

I have a very much related problem: Not only that the idea described by Mads Ipsen does not work, but I could generally find no efficient way to do a "counting" of elements in an array, as it is needed for a histogram. The function "histogram" contained in numpy uses a rather inefficient method, involving the sorting of the data array. What would instead be needed is a function that simply gives the count of occurances of given values in a given array:

...

...
...
[4,5,2,3,2,1,4].count([0,1,2,3,4,5]) [0,1,2,1,1,2]

All the solutions that I found so far involve either sorting of the data or writing a loop in Python, both of which are unacceptable for performance. Am I missing something obvious? Mads Ipsen wrote:

...

Hey,

First of all, thanks for the new release.

Here's another question regarding something I cannot quite understand:

Suppose you want to update bins for a histogram, you might think you could do something like:

g = zeros(4,Int) x = array([0.2, 0.2]) idx = floor(x/0.1).astype(int) g[idx] += 1

Here idx becomes

array([2, 2])

In this case, I would naively expect g to end up like

array([0, 0, 2, 0]) (1)

but instead you get

array([0, 0, 1, 0]) (2)

Is this intended? Just being plain novice-like naive, I would expect the slice operation g[idx] += 1 to do something like

for i in range(len(I)): g[ idx[i] ] += 1

resulting in (1) and not (2).

// Mads

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Travis Oliphant

17 Mar 17 Mar

1:16 a.m.

Norbert Nemec wrote:

...

I have a very much related problem: Not only that the idea described by Mads Ipsen does not work, but I could generally find no efficient way to do a "counting" of elements in an array, as it is needed for a histogram.

This may be something we are lacking. It depends on what you mean by efficient, I suppose. The sorting algorithms are very fast, and the histogram routines that are scattered all over the place (including the histogram function in numpy) has been in use for a long time, and you are the first person to complain of its efficiency. That isn't to say your complaint may not be valid, it's just that for most people the speed has been sufficient.

...

What would instead be needed is a function that simply gives the count of occurances of given values in a given array:

I presume you are talking of "integer" arrays, since equality-comparison of floating-point values is usually not very helpful so most histograms on floating-point values are given in terms of bins. Thus, the searchsorted function uses bins for it's "counting" operation.

...

...
...
...
[4,5,2,3,2,1,4].count([0,1,2,3,4,5])

[0,1,2,1,1,2]

A count function for integer arrays could certainly be written using a C-loop. But, I would first just use histogram([4,5,2,3,2,1,4], [0,1,2,3,4,5]) and make sure that's really too slow, before worrying about it too much. Also, I according to the above function, the right answer is: [0, 1, 2, 1, 2, 1] Best, -Travis

Norbert Nemec

30 Mar 30 Mar

6:14 a.m.

Sorry, Travis, I missed your answer for a week... I confess, my efficiency concerns about the histogram routine are to some extent, of theoretical nature. However, given the widespread use of the function and considering the simplicity of the solution, I think it is still worth thinking about it: The current histogram routine is based on a "sort" of the incoming data. Quicksort is usually O(N log N), worst case O(N**2). Doing the histogram by simply counting is a simple O(N) procedure. The absolute cost of the quicksort algorithm is low, but the simple counting should be even lower, if done efficiently in C. What probably weighs heavier though, is the memory usage: sorting can either be done in place, destroying the original data, or requires a copy. Counting could be done even on an iterator without storing all the data. Lacking a comparable implementation of the counting algorithm, I have not done any actual speed comparisons. Why nobody ever found this inefficiency, I cannot say. Probably nobody ever had a real comparison how fast the histogram could be. Unless you are dealing with really huge data, the difference will not show up. Anyway, as I said, a solution should be simple to code once it is decided how to do it. Simply recoding the histogram routine itself in C would be the minimal solution. Implementing a more flexible counting routine that can be used in other contexts as well would certainly be more desirable, but I have no clear vision what that should look like. Greetings, Norbert Travis Oliphant wrote:

...

Norbert Nemec wrote:

...
I have a very much related problem: Not only that the idea described by Mads Ipsen does not work, but I could generally find no efficient way to do a "counting" of elements in an array, as it is needed for a histogram.

This may be something we are lacking.

It depends on what you mean by efficient, I suppose. The sorting algorithms are very fast, and the histogram routines that are scattered all over the place (including the histogram function in numpy) has been in use for a long time, and you are the first person to complain of its efficiency. That isn't to say your complaint may not be valid, it's just that for most people the speed has been sufficient.

...
What would instead be needed is a function that simply gives the count of occurances of given values in a given array:

I presume you are talking of "integer" arrays, since equality-comparison of floating-point values is usually not very helpful so most histograms on floating-point values are given in terms of bins. Thus, the searchsorted function uses bins for it's "counting" operation.

...
...
...
...
[4,5,2,3,2,1,4].count([0,1,2,3,4,5])

[0,1,2,1,1,2]

A count function for integer arrays could certainly be written using a C-loop. But, I would first just use histogram([4,5,2,3,2,1,4], [0,1,2,3,4,5]) and make sure that's really too slow, before worrying about it too much.

Also, I according to the above function, the right answer is:

[0, 1, 2, 1, 2, 1]

Best,

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Tim Hochberg

8:53 a.m.

Norbert Nemec wrote:

...

Sorry, Travis, I missed your answer for a week...

I confess, my efficiency concerns about the histogram routine are to some extent, of theoretical nature. However, given the widespread use of the function and considering the simplicity of the solution, I think it is still worth thinking about it:

The current histogram routine is based on a "sort" of the incoming data. Quicksort is usually O(N log N), worst case O(N**2). Doing the histogram by simply counting is a simple O(N) procedure. The absolute cost of the quicksort algorithm is low, but the simple counting should be even lower, if done efficiently in C.

I'm not so sure about this analysis: histogram allows aribtrarily spaced bins, so I think the actual comparison is O(N log N) versus O(N log M) where M is the number of bins. The number of bins will typically be much lower than N it's true, but the log will tend to wash that out some. Take, for example, 1024 items into 32 bins. Then log N is 10 versus log M being 5. This difference is small enough that probably the differences in constant factors will dominate.

...

What probably weighs heavier though, is the memory usage: sorting can either be done in place, destroying the original data, or requires a copy.

I wouldn't worry about the memory usage itself so much. We all have oodles of RAM and virtual memory these days anyway. In addition, you'll probably end up copying discontiguous data anyway, which one ends up with rather frequently. However, the counting algorithm should be more cache friendly than the sort based on. I find that operations on matrices that don't fit in the cache are dominated by memory access time, not CPU time. On my machine that happens somewhere between 10k and 100k elements. For arrays with 100k or so elements, the counting algorithm might be a big win.

...

Counting could be done even on an iterator without storing all the data.

This is true. However, if you go through the iterator interface, it will be much slower than directly addressing the array memory. Such a thing might be cool to have somewhere (itertools or something), but probably isn't appropriate for numpy.

...

Lacking a comparable implementation of the counting algorithm, I have not done any actual speed comparisons. Why nobody ever found this inefficiency, I cannot say. Probably nobody ever had a real comparison how fast the histogram could be. Unless you are dealing with really huge data, the difference will not show up.

This is the crux of the issue I think. The implementaion of histogram looks fine for what it does. For reasonably large arrays, the overhead of it being implemented in Python will be negligible. Without someone who actually needs something different, a new implementation is liable to wander off into the realm of pointless microoptimization. So, my advice is to leave it alone till it actually causes you problems.

...

Anyway, as I said, a solution should be simple to code once it is decided how to do it. Simply recoding the histogram routine itself in C would be the minimal solution. Implementing a more flexible counting routine that can be used in other contexts as well would certainly be more desirable, but I have no clear vision what that should look like.

Well if you were going to do this. I would code a drop in replacement for the current histogram. No iterators (they belong in a different library). Variable spaces bins allowed, etc. It doesn't sound like we need a different histogram function and the more overlapping functions that there are, the less each individual one gets used and tested and the more likely that bugs and bitrot set in. What I'd acutally do is implent something that behaved like this: def histogram_core(data, bins): # I hope I grabbed the right stuff out histogram_core # bins is an actual list of bins edges as per histogram # which could really us a docstring n = sort(a).searchsorted(bins) n = concatenate([n, [len(a)]]) n = n[1:]-n[:-1] return n Except written in C and using a counting algorithm. Then the histogram can be just a thin wrapper over histogram_core. Then test it with a bunch of different array sizes and bin counts and see how it does. My guess is that the sort based algoritm will do better in some cases (arrays with 10k or fewer elements and lots of bins) while the counting algorithm does better in others (few bins or 100k or more elements). But we'll see. Or maybe we won't. Regards, -tim

...

Greetings, Norbert

Travis Oliphant wrote:

...
Norbert Nemec wrote:

...
I have a very much related problem: Not only that the idea described by Mads Ipsen does not work, but I could generally find no efficient way to do a "counting" of elements in an array, as it is needed for a histogram.

This may be something we are lacking.

It depends on what you mean by efficient, I suppose. The sorting algorithms are very fast, and the histogram routines that are scattered all over the place (including the histogram function in numpy) has been in use for a long time, and you are the first person to complain of its efficiency. That isn't to say your complaint may not be valid, it's just that for most people the speed has been sufficient.

...
What would instead be needed is a function that simply gives the count of occurances of given values in a given array:

I presume you are talking of "integer" arrays, since equality-comparison of floating-point values is usually not very helpful so most histograms on floating-point values are given in terms of bins. Thus, the searchsorted function uses bins for it's "counting" operation.

...
...
...
...
[4,5,2,3,2,1,4].count([0,1,2,3,4,5])

[0,1,2,1,1,2]

A count function for integer arrays could certainly be written using a C-loop. But, I would first just use histogram([4,5,2,3,2,1,4], [0,1,2,3,4,5]) and make sure that's really too slow, before worrying about it too much.

Also, I according to the above function, the right answer is:

[0, 1, 2, 1, 2, 1]

Best,

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Travis Oliphant

17 Mar 17 Mar

1 a.m.

Mads Ipsen wrote:

...

Hey,

First of all, thanks for the new release.

Here's another question regarding something I cannot quite understand:

Suppose you want to update bins for a histogram, you might think you could do something like:

g = zeros(4,Int) x = array([0.2, 0.2]) idx = floor(x/0.1).astype(int) g[idx] += 1

Here idx becomes

array([2, 2])

In this case, I would naively expect g to end up like

array([0, 0, 2, 0]) (1)

but instead you get

array([0, 0, 1, 0]) (2)

Is this intended? Just being plain novice-like naive, I would expect the slice operation g[idx] += 1 to do something like

Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though). The issue is that the code g[idx] +=1 is equivalent in Python to g[idx] = g[idx] + 1 Then g[idx] returns array([0,0]) as it should. This new copy of the array data then gets added to 1 resulting in array([1,1]). This array is then set into element 2 of g twice just as if you had done g[2] = 1 g[2] = 1 Then end effect is you get a 1 out of the result. Perhaps a little counter to the way you were thinking of the problem, but very much how it must be due to the way Python translates the statement g[idx] += 1 in this case. There are, of course, other ways to do what you want to do. -Travis

Piotr Luszczek

6:44 a.m.

On Friday 17 March 2006 03:59, Travis Oliphant wrote:

...

Mads Ipsen wrote:

...
Hey,

First of all, thanks for the new release.

Here's another question regarding something I cannot quite understand:

Suppose you want to update bins for a histogram, you might think you could do something like:

g = zeros(4,Int) x = array([0.2, 0.2]) idx = floor(x/0.1).astype(int) g[idx] += 1

Here idx becomes

array([2, 2])

In this case, I would naively expect g to end up like

array([0, 0, 2, 0]) (1)

but instead you get

array([0, 0, 1, 0]) (2)

Is this intended? Just being plain novice-like naive, I would expect the slice operation g[idx] += 1 to do something like

Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though).

The issue is that the code g[idx] +=1 is equivalent in Python to

g[idx] = g[idx] + 1

This is not what I read at: http://docs.python.org/ref/numeric-types.html Quote: These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self). What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

...

Then g[idx] returns array([0,0]) as it should. This new copy of the array data then gets added to 1 resulting in array([1,1]). This array is then set into element 2 of g twice just as if you had done

g[2] = 1 g[2] = 1

Then end effect is you get a 1 out of the result.

Perhaps a little counter to the way you were thinking of the problem, but very much how it must be due to the way Python translates the statement g[idx] += 1 in this case.

There are, of course, other ways to do what you want to do.

Histograms are just the trivial instantiation of the problem (the problem being unintuitive behavior at least for Mads and me). If in-place arithmetic didn't make copies it would land its nicely to sparse matrix computations. Efficient, compact and in Python. And there are also other histogram-like computations but do not use '+' but other operators. Piotr

...

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121 642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Travis Oliphant

10:30 a.m.

Piotr Luszczek wrote:

...

...
Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though).

The issue is that the code g[idx] +=1 is equivalent in Python to

g[idx] = g[idx] + 1

This is not what I read at:

http://docs.python.org/ref/numeric-types.html

Quote:

These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).

What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

The problems is that this explanation is very clear when we are talking about the syntax g += 1 Then, there is a method of g that can be over-ridden to get the desired behavior. Now, what you are talking about is "indexing followed by in-place addition". i.e. at issue is how does python interpret g[idx] += 1 How does this get compiled to byte-code? There are two possibilities: 1) g[idx] creates a new object which then has 1 added to it using in-place addition. This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut? 2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1 This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride. There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here. Note, however, if idx is slice syntax, then the operation is done without making copies because, for example, g[1:10:2] returns a "view" of the data (an array that shares the same memory) as the original array. Thus, g[1:10:2] += 1 does an inplace add without data-copying. It may be possible to do what you want using zero-strided arrays, but I'll leave that for a more motivated contributor. -Travis

Piotr Luszczek

11:26 a.m.

On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...

Piotr Luszczek wrote:

...
...
Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though).

The issue is that the code g[idx] +=1 is equivalent in Python to

g[idx] = g[idx] + 1

This is not what I read at:

http://docs.python.org/ref/numeric-types.html

Quote:

These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).

What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

The problems is that this explanation is very clear when we are talking about the syntax

g += 1

Then, there is a method of g that can be over-ridden to get the desired behavior. Now, what you are talking about is "indexing followed by in-place addition".

i.e. at issue is

how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__) By design numpy returns views from __getitem__ In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same. Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

...

Note, however, if idx is slice syntax, then the operation is done without making copies because, for example,

g[1:10:2]

returns a "view" of the data (an array that shares the same memory) as the original array. Thus,

g[1:10:2] += 1

does an inplace add without data-copying.

It may be possible to do what you want using zero-strided arrays, but I'll leave that for a more motivated contributor.

-Travis

Tim Hochberg

11:44 a.m.

Piotr Luszczek wrote:

...

On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...
Piotr Luszczek wrote:

...
...
Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though).

The issue is that the code g[idx] +=1 is equivalent in Python to

g[idx] = g[idx] + 1

This is not what I read at:

http://docs.python.org/ref/numeric-types.html

Quote:

These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).

What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

The problems is that this explanation is very clear when we are talking about the syntax

g += 1

Then, there is a method of g that can be over-ridden to get the desired behavior. Now, what you are talking about is "indexing followed by in-place addition".

i.e. at issue is

how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__)

By design numpy returns views from __getitem__

You are missing that idx is not a slice index. rather it is an index array (or list). In this case g[idx] does *not* return a view, it returns a copy. From this everything else follows. Conceivably g[idx] could return a psuedo-array object like flat does, but I suspect that might have bad performance characteristics. -tim

...

In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

...
Note, however, if idx is slice syntax, then the operation is done without making copies because, for example,

g[1:10:2]

returns a "view" of the data (an array that shares the same memory) as the original array. Thus,

g[1:10:2] += 1

does an inplace add without data-copying.

It may be possible to do what you want using zero-strided arrays, but I'll leave that for a more motivated contributor.

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Piotr Luszczek

12:14 p.m.

On Friday 17 March 2006 14:43, Tim Hochberg wrote:

...

Piotr Luszczek wrote:

...
On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...
Piotr Luszczek wrote:

...
...
Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though).

The issue is that the code g[idx] +=1 is equivalent in Python to

g[idx] = g[idx] + 1

This is not what I read at:

http://docs.python.org/ref/numeric-types.html

Quote:

These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).

What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

The problems is that this explanation is very clear when we are talking about the syntax

g += 1

Then, there is a method of g that can be over-ridden to get the desired behavior. Now, what you are talking about is "indexing followed by in-place addition".

i.e. at issue is

how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__)

By design numpy returns views from __getitem__

You are missing that idx is not a slice index. rather it is an index array (or list). In this case g[idx] does *not* return a view, it returns a copy. From this everything else follows.

You are missing that indexing with array doesn't have to make a copy. From this everythin else follows. In other words, It's a design decision that slices give views and indirection with arrays (I don't care for lists) gives a copy. I question things on conceptual level: both objects (the array an the indexing array) are from numpy. What code would break if the operation didn't return a copy?

...

Conceivably g[idx] could return a psuedo-array object like flat does, but I suspect that might have bad performance characteristics.

-tim

...
In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

...
Note, however, if idx is slice syntax, then the operation is done without making copies because, for example,

g[1:10:2]

returns a "view" of the data (an array that shares the same memory) as the original array. Thus,

g[1:10:2] += 1

does an inplace add without data-copying.

It may be possible to do what you want using zero-strided arrays, but I'll leave that for a more motivated contributor.

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=1 21642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121 642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Tim Hochberg

1:50 p.m.

Piotr Luszczek wrote:

...

On Friday 17 March 2006 14:43, Tim Hochberg wrote:

...
Piotr Luszczek wrote:

...
On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...
Piotr Luszczek wrote:

...
...
Yes, this is intended (sort of --- this particular example isn't the reason for the behavior though).

The issue is that the code g[idx] +=1 is equivalent in Python to

g[idx] = g[idx] + 1

This is not what I read at:

http://docs.python.org/ref/numeric-types.html

Quote:

These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).

What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

The problems is that this explanation is very clear when we are talking about the syntax

g += 1

Then, there is a method of g that can be over-ridden to get the desired behavior. Now, what you are talking about is "indexing followed by in-place addition".

i.e. at issue is

how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__)

By design numpy returns views from __getitem__

You are missing that idx is not a slice index. rather it is an index array (or list). In this case g[idx] does *not* return a view, it returns a copy. From this everything else follows.

You are missing that indexing with array doesn't have to make a copy.

...

From this everythin else follows. In other words, It's a design decision that slices give views and indirection with arrays (I don't care for lists) gives a copy. I question things on conceptual level: both objects (the array an the indexing array) are from numpy. What code would break if the operation didn't return a copy?

...
Conceivably g[idx] could return a psuedo-array object like flat does, but I suspect that might have bad performance characteristics.

As I said conceivably it could be done without making a copy. However

No I'm not. that would have performance implications; some good and some bad. It's almost certain that some code would break too, although I don't think that's much of a concern if someone got this in before numpy 1.0. We'll probably never know how well or ill this performs unless someone steps up and trys to implement this. I'm not interested. Are you? -tim

...

...
-tim

...
In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

...
Note, however, if idx is slice syntax, then the operation is done without making copies because, for example,

g[1:10:2]

returns a "view" of the data (an array that shares the same memory) as the original array. Thus,

g[1:10:2] += 1

does an inplace add without data-copying.

It may be possible to do what you want using zero-strided arrays, but I'll leave that for a more motivated contributor.

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=1 21642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121 642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Piotr Luszczek

2:20 p.m.

On Friday 17 March 2006 16:49, Tim Hochberg wrote:

...

Piotr Luszczek wrote:

...
On Friday 17 March 2006 14:43, Tim Hochberg wrote:

...
Piotr Luszczek wrote:

...
On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...
Piotr Luszczek wrote:

...
>Yes, this is intended (sort of --- this particular example > isn't the reason for the behavior though). > >The issue is that the code g[idx] +=1 is equivalent in Python > to > >g[idx] = g[idx] + 1

This is not what I read at:

http://docs.python.org/ref/numeric-types.html

Quote:

These methods should attempt to do the operation in-place (modifying self) and return the result (which could be, but does not have to be, self).

What you describe is "lack of attempt" in which case the "fall back" behavior gets triggered.

The problems is that this explanation is very clear when we are talking about the syntax

g += 1

Then, there is a method of g that can be over-ridden to get the desired behavior. Now, what you are talking about is "indexing followed by in-place addition".

i.e. at issue is

how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__)

By design numpy returns views from __getitem__

You are missing that idx is not a slice index. rather it is an index array (or list). In this case g[idx] does *not* return a view, it returns a copy. From this everything else follows.

You are missing that indexing with array doesn't have to make a copy.

No I'm not.

...
From this everythin else follows. In other words, It's a design decision that slices give views and indirection with arrays (I don't care for lists) gives a copy. I question things on conceptual level: both objects (the array an the indexing array) are from numpy. What code would break if the operation didn't return a copy?

...
Conceivably g[idx] could return a psuedo-array object like flat does, but I suspect that might have bad performance characteristics.

As I said conceivably it could be done without making a copy. However that would have performance implications; some good and some bad. It's almost certain that some code would break too, although I don't think that's much of a concern if someone got this in before numpy 1.0. We'll probably never know how well or ill this performs unless someone steps up and trys to implement this. I'm not interested. Are you?

Yes. Very much so. To me it's a very compact and intuitive way of encoding many things I do. Histogram is trivial but useful. Sparse matrix calculations are more important to me. I've raised this issue at least once before and didn't get much response. I think it's better this time around. But ultimately I agree with Robert Kern that I need to show some code and timing data to make people pay attention.

...

-tim

...
...
-tim

...
In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

...
Note, however, if idx is slice syntax, then the operation is done without making copies because, for example,

g[1:10:2]

returns a "view" of the data (an array that shares the same memory) as the original array. Thus,

g[1:10:2] += 1

does an inplace add without data-copying.

It may be possible to do what you want using zero-strided arrays, but I'll leave that for a more motivated contributor.

-Travis

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat= 1 21642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=1 21 642 _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Tim Hochberg

2:21 p.m.

New subject: Copy vs View for array[array] (was Histograms via indirect index arrays)

I just figured I'd add a couple of thought outside that needlessly contentious thread. In theory I'm all for view semantics for an array indexed by an array (I'm sure we have a good name for that, but it's escaping me). Indexing in numpy can be confusing enough without some indexing operations returning views and others copies. This is orthogonal to any issues of performance. In practice, I'm a bit skeptical. The result would need to be some sort of psuedo array object (similar to array.flat). Operations on this object would necessarily have worse performance than operations on a normal array due to the added level of indirection. In some circumstances it would also hold onto a lot of memory that might otherwise be freed since it hold a reference to the data for both the original array and the index array. How much of an effect this would have on the typical user is hard to say; any effects would certainly depend a lot on usage patterns. Here's some cases and guesses as to how I think things would work out (idx is an array): a[idx] += 1 # Improved performance and more importantly result consistent with other indexing ops c = a[idx] + b[idx] # Probably neutral; Creating the psuedo arrays should be fast, then they need to copied anyway in the add. c, d = a[idx], b[idx] del a, b e = c + d f = 2*c + d # Probably bad: Need to copy the psuedo array into a contiguous buffer multiple times. # Also holds extra memory. I'd like to see someone try it, but it's not high enough on my priority list for me to dive into it (I'm blowing all my spare cycles and then some on numexpr). -tim

Rick White

2:57 p.m.

New subject: Copy vs View for array[array] (was Histograms via indirect index arrays)

On Fri, 17 Mar 2006, Tim Hochberg wrote:

...

In theory I'm all for view semantics for an array indexed by an array (I'm sure we have a good name for that, but it's escaping me). Indexing in numpy can be confusing enough without some indexing operations returning views and others copies. This is orthogonal to any issues of performance.

In practice, I'm a bit skeptical. The result would need to be some sort of psuedo array object (similar to array.flat). Operations on this object would necessarily have worse performance than operations on a normal array due to the added level of indirection. In some circumstances it would also hold onto a lot of memory that might otherwise be freed since it hold a reference to the data for both the original array and the index array.

Actually I think it is worse than that -- it seems to me that it actually has to make a *copy* of the index array. I don't think that we would want to keep only a reference to the index array, since if it changed then the view could respond by changing in very unexpected ways. That sounds like a nightmare side-effect to me. That's what has always made me think that this is not a good idea, even if the bookkeeping of carrying around an unevaluated array+indices could be worked out efficiently. In my applications I sometimes use very large index arrays, and I don't want to have to copy them unnecessarily. Generally I much prefer instant evaluation as in the current implementation, since that uses the minimum of memory. For what it's worth, IDL behaves exactly like the current numpy: a[idx] += 1 increments each element by 1 regardless of how many times a particular index is included in idx.

Tim Hochberg

3:16 p.m.

New subject: Copy vs View for array[array] (was Histograms via indirect index arrays)

Rick White wrote:

...

On Fri, 17 Mar 2006, Tim Hochberg wrote:

...
In theory I'm all for view semantics for an array indexed by an array (I'm sure we have a good name for that, but it's escaping me). Indexing in numpy can be confusing enough without some indexing operations returning views and others copies. This is orthogonal to any issues of performance.

In practice, I'm a bit skeptical. The result would need to be some sort of psuedo array object (similar to array.flat). Operations on this object would necessarily have worse performance than operations on a normal array due to the added level of indirection. In some circumstances it would also hold onto a lot of memory that might otherwise be freed since it hold a reference to the data for both the original array and the index array.

Actually I think it is worse than that -- it seems to me that it actually has to make a *copy* of the index array. I don't think that we would want to keep only a reference to the index array, since if it changed then the view could respond by changing in very unexpected ways. That sounds like a nightmare side-effect to me.

Yeah, that would be bad. Conceivably (there's that word again) one could implement copy-on-write for the index arrays, but that would be another can of worms. Anwway, I agree that you would have to at least fake a copy somehow.

...

That's what has always made me think that this is not a good idea, even if the bookkeeping of carrying around an unevaluated array+indices could be worked out efficiently. In my applications I sometimes use very large index arrays, and I don't want to have to copy them unnecessarily. Generally I much prefer instant evaluation as in the current implementation, since that uses the minimum of memory.

For what it's worth, IDL behaves exactly like the current numpy: a[idx] += 1 increments each element by 1 regardless of how many times a particular index is included in idx.

I'm not much concerned with case myself. The case that bothers me (more in the abstract than in reality since I don't use index arrays much) is the following:

...

...
...
idx1 = slice(0,None,2) idx2 = [0,2,4,6,8,10] idx1 = slice(0,None,2) idx2 = [0,2,4,6,8] a = arange(10) b = arange(10) ai = a[idx1] bi = b[idx2] ai array([0, 2, 4, 6, 8]) bi array([0, 2, 4, 6, 8]) ai[1] = bi[1] = 99 a array([ 0, 1, 99, 3, 4, 5, 6, 7, 8, 9]) b array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The fact that 'a' and 'b' diverge here is a wart, although not one I'm sure it's worth doing anything about. -tim

...

Robert Kern

11:59 a.m.

Piotr Luszczek wrote:

...

On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...

...
how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

In [161]: c = compile('g[idx] += 1', '<str>', 'single') In [162]: import dis In [163]: dis.dis(c) 1 0 LOAD_NAME 0 (g) 3 LOAD_NAME 1 (idx) 6 DUP_TOPX 2 9 BINARY_SUBSCR 10 LOAD_CONST 0 (1) 13 INPLACE_ADD 14 ROT_THREE 15 STORE_SUBSCR 16 LOAD_CONST 1 (None) 19 RETURN_VALUE

...

...
There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__)

By design numpy returns views from __getitem__

Only for slices. In [132]: a = arange(10) In [133]: idx = [2,2,3] In [134]: a[idx] Out[134]: array([2, 2, 3]) In [135]: b = a[idx] In [136]: b[-1] = 100 In [137]: a Out[137]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

...

In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

No, value is the result of __iadd__ on the temporary array. 'g[idx] += 1' expands to: tmp = g.__getitem__(idx) val = tmp.__iadd__(1) g.__setitem__(idx, val) Given these class definitions: class A(object): def __getitem__(self, idx): print 'A.__getitem__(%r)' % idx return B() def __setitem__(self, idx, value): print 'A.__setitem__(%r, %r)' % (idx, value) class B(object): def __iadd__(self, x): print 'B.__iadd__(%r)' % x return self def __repr__(self): return 'B()' In [153]: a = A() In [154]: a[[0, 2, 2, 1]] += 1 A.__getitem__([0, 2, 2, 1]) B.__iadd__(1) A.__setitem__([0, 2, 2, 1], B())

...

Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

Aye, there's the rub. -- Robert Kern robert.kern@gmail.com "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Piotr Luszczek

12:21 p.m.

On Friday 17 March 2006 14:58, Robert Kern wrote:

...

Piotr Luszczek wrote:

...
On Friday 17 March 2006 13:29, Travis Oliphant wrote:

...
how does python interpret

g[idx] += 1

How does this get compiled to byte-code?

In [161]: c = compile('g[idx] += 1', '<str>', 'single')

In [162]: import dis

In [163]: dis.dis(c) 1 0 LOAD_NAME 0 (g) 3 LOAD_NAME 1 (idx) 6 DUP_TOPX 2 9 BINARY_SUBSCR 10 LOAD_CONST 0 (1) 13 INPLACE_ADD 14 ROT_THREE 15 STORE_SUBSCR 16 LOAD_CONST 1 (None) 19 RETURN_VALUE

This proves my point.

...

...
...
There are two possibilities:

1) g[idx] creates a new object which then has 1 added to it using in-place addition.

This would not produce the desired behavior as g[idx] is a copy of the data when idx is a general indexing array as it is in this case. So, you make a copy of those indices, add 1 to them and then do what with the resut?

2) g[idx] += 1 gets converted to g[idx] = g[idx] + 1

This appears to be effectively what Python actually does. Notice that there is no way for us to control this behavior because there is no __inplace_with_indexing_add__ operator to over-ride.

There is no such single operation to over-ride for the object. In other words, I don't see anyay for us to even alter the object to get the behavior you want from that syntax. We can, of course, add a function or method to do that, but I we would have to extend Python to get the behavior you want here.

Hardly. At least from what I'm seeing happens on a small example. 'g[idx] += 1' becomes ('g' and 'idx' are generic objects): __getitem__(self, idx) __iadd__(1) __setitem__(result of __iadd__)

By design numpy returns views from __getitem__

Only for slices.

In [132]: a = arange(10)

In [133]: idx = [2,2,3]

In [134]: a[idx] Out[134]: array([2, 2, 3])

In [135]: b = a[idx]

In [136]: b[-1] = 100

In [137]: a Out[137]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Your example uses lists as indices. This is not interesting. I'm talking solely about arrays indexing other arrays. To me it is a special and very important case.

...

...
In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

No, value is the result of __iadd__ on the temporary array.

'g[idx] += 1' expands to:

tmp = g.__getitem__(idx) val = tmp.__iadd__(1) g.__setitem__(idx, val)

You're missing the point. 'tmp' can be of a very specific type so that 'g.__setitem__' doesn't have to do anything: the 'add 1' was done by '__iadd__'.

...

Given these class definitions:

class A(object): def __getitem__(self, idx): print 'A.__getitem__(%r)' % idx return B() def __setitem__(self, idx, value): print 'A.__setitem__(%r, %r)' % (idx, value)

class B(object): def __iadd__(self, x): print 'B.__iadd__(%r)' % x return self def __repr__(self): return 'B()'

In [153]: a = A()

In [154]: a[[0, 2, 2, 1]] += 1 A.__getitem__([0, 2, 2, 1]) B.__iadd__(1) A.__setitem__([0, 2, 2, 1], B())

...
Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

Aye, there's the rub.

Show me a code that breaks.

Robert Kern

1:06 p.m.

Piotr Luszczek wrote:

...

On Friday 17 March 2006 14:58, Robert Kern wrote:

...
Piotr Luszczek wrote:

...

...
...
By design numpy returns views from __getitem__

Only for slices.

In [132]: a = arange(10)

In [133]: idx = [2,2,3]

In [134]: a[idx] Out[134]: array([2, 2, 3])

In [135]: b = a[idx]

In [136]: b[-1] = 100

In [137]: a Out[137]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Your example uses lists as indices. This is not interesting. I'm talking solely about arrays indexing other arrays. To me it is a special and very important case.

The result is exactly the same. In [164]: a = arange(10) In [165]: idx = array([2,2,3]) In [166]: b = a[idx] In [167]: b[-1] = 100 In [168]: a Out[168]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

...

...
...
In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

No, value is the result of __iadd__ on the temporary array.

'g[idx] += 1' expands to:

tmp = g.__getitem__(idx) val = tmp.__iadd__(1) g.__setitem__(idx, val)

You're missing the point. 'tmp' can be of a very specific type so that 'g.__setitem__' doesn't have to do anything: the 'add 1' was done by '__iadd__'.

No, I got your point just fine; I was correcting a detail. You would have to reimplement __getitem__ to return a new kind of object that represents a non-uniformly-strided array. If you want to get anywhere, go implement that object and come back. When we have something concrete to look at instead of vague assertions, then we can start tackling the issues of integrating it into the core such that 'g[idx] += 1' works like you want it to. For example, index arrays are used in more places than in-place addition. Your new type needs to be usable in all of those places since __getitem__, __iadd__ and __setitem__ don't know that they are being called in that order and in that fashion.

...

...
Given these class definitions:

class A(object): def __getitem__(self, idx): print 'A.__getitem__(%r)' % idx return B() def __setitem__(self, idx, value): print 'A.__setitem__(%r, %r)' % (idx, value)

class B(object): def __iadd__(self, x): print 'B.__iadd__(%r)' % x return self def __repr__(self): return 'B()'

In [153]: a = A()

In [154]: a[[0, 2, 2, 1]] += 1 A.__getitem__([0, 2, 2, 1]) B.__iadd__(1) A.__setitem__([0, 2, 2, 1], B())

...
Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

Aye, there's the rub.

Show me a code that breaks.

<shrug> Show us some code that works. I'm not interested in implementing your feature request. You are. There's plenty of work that you can do that doesn't depend on anyone else agreeing with you, so you can stop arguing and start coding. -- Robert Kern robert.kern@gmail.com "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Piotr Luszczek

1:32 p.m.

On Friday 17 March 2006 16:04, Robert Kern wrote:

...

Piotr Luszczek wrote:

...
On Friday 17 March 2006 14:58, Robert Kern wrote:

...
Piotr Luszczek wrote:

...
By design numpy returns views from __getitem__

Only for slices.

In [132]: a = arange(10)

In [133]: idx = [2,2,3]

In [134]: a[idx] Out[134]: array([2, 2, 3])

In [135]: b = a[idx]

In [136]: b[-1] = 100

In [137]: a Out[137]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Your example uses lists as indices. This is not interesting. I'm talking solely about arrays indexing other arrays. To me it is a special and very important case.

The result is exactly the same.

In [164]: a = arange(10)

In [165]: idx = array([2,2,3])

In [166]: b = a[idx]

In [167]: b[-1] = 100

In [168]: a Out[168]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

...
...
...
In this case, it would be view into 'self' and 'idx' so the __iadd__ would just use the 'idx' directly rather than a copy. Finally, __setitem__ doesn't do anything since 'self' and 'value' will be the same.

No, value is the result of __iadd__ on the temporary array.

'g[idx] += 1' expands to:

tmp = g.__getitem__(idx) val = tmp.__iadd__(1) g.__setitem__(idx, val)

You're missing the point. 'tmp' can be of a very specific type so that 'g.__setitem__' doesn't have to do anything: the 'add 1' was done by '__iadd__'.

No, I got your point just fine; I was correcting a detail.

You would have to reimplement __getitem__ to return a new kind of object that represents a non-uniformly-strided array. If you want to get anywhere, go implement that object and come back. When we have something concrete to look at instead of vague assertions, then we can start tackling the issues of integrating it into the core such that 'g[idx] += 1' works like you want it to. For example, index arrays are used in more places than in-place addition. Your new type needs to be usable in all of those places since __getitem__, __iadd__ and __setitem__ don't know that they are being called in that order and in that fashion.

This is a tough requirement but perfectly reasonable. So when my day job let's me off the hook I'll give it a try.

...

...
...
Given these class definitions:

class A(object): def __getitem__(self, idx): print 'A.__getitem__(%r)' % idx return B() def __setitem__(self, idx, value): print 'A.__setitem__(%r, %r)' % (idx, value)

class B(object): def __iadd__(self, x): print 'B.__iadd__(%r)' % x return self def __repr__(self): return 'B()'

In [153]: a = A()

In [154]: a[[0, 2, 2, 1]] += 1 A.__getitem__([0, 2, 2, 1]) B.__iadd__(1) A.__setitem__([0, 2, 2, 1], B())

...
Of course, this is just a quick draft. I don't know how it would work in practice and in other cases.

Aye, there's the rub.

Show me a code that breaks.

<shrug> Show us some code that works. I'm not interested in implementing your feature request. You are. There's plenty of work that you can do that doesn't depend on anyone else agreeing with you, so you can stop arguing and start coding.

Arguing is good to a point and I think you're right that it's time to stop.

Travis Oliphant

18 Mar 18 Mar

8:59 p.m.

Piotr Luszczek wrote:

...

...
<shrug> Show us some code that works. I'm not interested in implementing your feature request. You are. There's plenty of work that you can do that doesn't depend on anyone else agreeing with you, so you can stop arguing and start coding.

Arguing is good to a point and I think you're right that it's time to stop.

All true. However, I want to thank Piotr for illustrating an important but easy to overlook point that in-fact "advanced" indexing produces copies while "simple-indexing" produces views. The reason behind this as Tim and Rick later illustrate is that no one quite knows how to produce "view" semantics for an arbitrarily indexed array. I did actually think about this for a while because I don't completely like the different behaviors which can produce warts like the one that Tim showed us. I realized that it would be possible to extend our memory model of an array to include an index array in the array structure itself that would indicate how to advance through memory (basically a pointer or its equivalent for each array element). Rick showed that because this would have to be a "copy" of the indices and so would create too much slowness. My reason for not going there was because it seemed to add too much complication for my tastes at the time to all the code. But, one could still sub-class the ndarray and implement such a thing. I think this is the best place to explore such experimental ideas. Thanks again for the discussion. There are issues like this that it is good to get out into the open. -Travis

Michael Spencer

20 Mar 20 Mar

12:21 p.m.

Travis Oliphant wrote:

...

Piotr Luszczek wrote:

...

The reason behind this as Tim and Rick later illustrate is that no one quite knows how to produce "view" semantics for an arbitrarily indexed array. I did actually think about this for a while because I don't completely like the different behaviors which can produce warts like the one that Tim showed us. I realized that it would be possible to extend our memory model of an array to include an index array in the array structure itself that would indicate how to advance through memory (basically a pointer or its equivalent for each array element). Rick showed that because this would have to be a "copy" of the indices and so would create too much slowness. My reason for not going there was because it seemed to add too much complication for my tastes at the time to all the code.

FWIW, I implemented a pure-python ndarray, independent of numpy but with a similar interface. pyarray preserves views for arbitrary indexings, by carrying around an index array as you suggest. The reason for mentioning this here, is that looking at it might help someone think about desired semantics of views. pyarray: http://svn.brownspencer.com/pyarray/trunk/ Michael

6601

Age (days ago)

6617

Last active (days ago)

List overview

Download

21 comments

9 participants

participants (9)

Mads Ipsen
Michael Spencer
Norbert Nemec
Piotr Luszczek
Rick White
Robert Kern
Tim Hochberg
Travis Oliphant
Travis Oliphant

Histograms via indirect index arrays

Mads Ipsen

Norbert Nemec

Norbert Nemec

Tim Hochberg

Tim Hochberg

Tim Hochberg

Tim Hochberg

Tim Hochberg

Michael Spencer

tags

participants (9)