[Numpy-discussion] sparse array data

Nathaniel Smith njs at pobox.com
Thu May 3 05:41:28 EDT 2012


On Thu, May 3, 2012 at 4:44 AM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Wed, May 2, 2012 at 3:20 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> This coordinate format is also what's used by the MATLAB Tensor
>> Toolbox. They have a paper justifying this choice and describing some
>> tricks for how to work with them:
>>  http://epubs.siam.org/sisc/resource/1/sjoce3/v30/i1/p205_s1
>> (Spoiler: you use a lot of sort operations. Conveniently, timsort
>> appears to be perfectly adapted for their algorithmic requirements.)
>>
>
> Timsort probably isn't the best choice here, it is optimized for python
> lists of python objects where there is at least one level of indirection and
> compares are expensive, even more expensive for compound objects. If the
> coordinates are stored in numpy structured arrays an ordinary sort is likely
> to be faster. Lexsort might even be faster as it could access aligned
> integer data and not have to move lists of indexes around.

To be clear, I don't mean Timsort-the-implementation, I mean
Timsort-the-algorithm (which is now also the default sorting algorithm
in Java). That said, it may well be optimized for expensive compares
and something like a radix sort would be even better.

In these sparse tensor algorithms, we often need to sort by one
coordinate axis, and then sort by another (a "corner turn"). The
reason Timsort seems appealing is that (1) it goes faster than O(n log
n) when there is existing structure in the data being sorted, (2)
because it's a stable sort, sorting on one axis then sorting on
another will leave lots of that structure behind to be exploited
later. So one can expect it to hit its happy case relatively often.

-- Nathaniel



More information about the NumPy-Discussion mailing list