[Numpy-discussion] sparse array data
wkerzendorf at gmail.com
Sat May 5 23:23:59 EDT 2012
Thanks for all the suggestions. They seem already relatively involved, but I'll have a look at the table implementation. That seems to be the easiest of them all.
On 2012-05-03, at 9:39 AM, Charles R Harris wrote:
> On Thu, May 3, 2012 at 3:41 AM, Nathaniel Smith <njs at pobox.com> wrote:
> On Thu, May 3, 2012 at 4:44 AM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> > On Wed, May 2, 2012 at 3:20 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> This coordinate format is also what's used by the MATLAB Tensor
> >> Toolbox. They have a paper justifying this choice and describing some
> >> tricks for how to work with them:
> >> http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1
> >> (Spoiler: you use a lot of sort operations. Conveniently, timsort
> >> appears to be perfectly adapted for their algorithmic requirements.)
> > Timsort probably isn't the best choice here, it is optimized for python
> > lists of python objects where there is at least one level of indirection and
> > compares are expensive, even more expensive for compound objects. If the
> > coordinates are stored in numpy structured arrays an ordinary sort is likely
> > to be faster. Lexsort might even be faster as it could access aligned
> > integer data and not have to move lists of indexes around.
> To be clear, I don't mean Timsort-the-implementation, I mean
> Timsort-the-algorithm (which is now also the default sorting algorithm
> in Java). That said, it may well be optimized for expensive compares
> and something like a radix sort would be even better.
> Java uses Timsort for sorting object arrays (pointers) and dual pivot quicksort for sorting arrays of native types, ints and such. Timsort is very good for almost sorted arrays and the mixed algorithms are becoming popular, i.e., introsort and recent updates to the dual pivot sort that also look for runs. One of the reasons compares can be expensive for arrays of pointers to objects is that the objects can be located all over memory, which blows the cache.
> There are also a few mods to the numpy quicksort that might speed things up a bit more for common cases where there are a lot of repeated elements.
> In these sparse tensor algorithms, we often need to sort by one
> coordinate axis, and then sort by another (a "corner turn"). The
> reason Timsort seems appealing is that (1) it goes faster than O(n log
> n) when there is existing structure in the data being sorted, (2)
> because it's a stable sort, sorting on one axis then sorting on
> another will leave lots of that structure behind to be exploited
> later. So one can expect it to hit its happy case relatively often.
> Yes, that's why we have mergesort. An optimistic version making some use of runs might make it faster. We do have object arrays and no type specialized sort for them, so bringing Timsort in for those could be useful.
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion