[Numpy-discussion] sparse array data

Charles R Harris charlesr.harris at gmail.com
Thu May 3 09:39:45 EDT 2012

On Thu, May 3, 2012 at 3:41 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Thu, May 3, 2012 at 4:44 AM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Wed, May 2, 2012 at 3:20 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> This coordinate format is also what's used by the MATLAB Tensor
> >> Toolbox. They have a paper justifying this choice and describing some
> >> tricks for how to work with them:
> >>  http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1
> >> (Spoiler: you use a lot of sort operations. Conveniently, timsort
> >> appears to be perfectly adapted for their algorithmic requirements.)
> >>
> >
> > Timsort probably isn't the best choice here, it is optimized for python
> > lists of python objects where there is at least one level of indirection
> and
> > compares are expensive, even more expensive for compound objects. If the
> > coordinates are stored in numpy structured arrays an ordinary sort is
> likely
> > to be faster. Lexsort might even be faster as it could access aligned
> > integer data and not have to move lists of indexes around.
> To be clear, I don't mean Timsort-the-implementation, I mean
> Timsort-the-algorithm (which is now also the default sorting algorithm
> in Java). That said, it may well be optimized for expensive compares
> and something like a radix sort would be even better.
Java uses Timsort for sorting object arrays (pointers) and dual pivot
quicksort for sorting arrays of native types, ints and such. Timsort is
very good for almost sorted arrays and the mixed algorithms are becoming
popular, i.e., introsort and recent updates to the dual pivot sort that
also look for runs. One of the reasons compares can be expensive for arrays
of pointers to objects is that the objects can be located all over memory,
which blows the cache.

There are also a few mods to the numpy quicksort that might speed things up
a bit more for common cases where there are a lot of repeated elements.

In these sparse tensor algorithms, we often need to sort by one
> coordinate axis, and then sort by another (a "corner turn"). The
> reason Timsort seems appealing is that (1) it goes faster than O(n log
> n) when there is existing structure in the data being sorted, (2)
> because it's a stable sort, sorting on one axis then sorting on
> another will leave lots of that structure behind to be exploited
> later. So one can expect it to hit its happy case relatively often.
Yes, that's why we have mergesort. An optimistic version making some use of
runs might make it faster. We do have object arrays and no type specialized
sort for them, so bringing Timsort in for those could be useful.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120503/33402ea1/attachment.html>

More information about the NumPy-Discussion mailing list