<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, Sep 3, 2014 at 4:07 AM, Jaime Fernández del Río <span dir="ltr"><<a href="mailto:jaime.frio@gmail.com" target="_blank">jaime.frio@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">

<div>On Tue, Sep 2, 2014 at 5:40 PM, Charles R Harris <span dir="ltr"><<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid">


<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div><div><br></div></div></div><div>What do you think about the suggestion of timsort? One would need to concatenate the arrays before sorting, but it should be fairly efficient.<br>


</div></div></div></div></blockquote><div><br></div></div><div>Timsort is very cool, and it would definitely be fun to implement in numpy. It is also a lot more work that merging two sorted arrays! I think +1 if someone else does it, but although I would love to be part of it, I am not sure I will be able to find time to look into it seriously in the next couple of months.</div>


<div><br></div><div>From a setops point of view, merging two sorted arrays makes it very straightforward to compute, together with (or instead of) the result of the merge, index arrays that let you calculate things like `in1d` faster. Although perhaps an `argtimsort` could provide the same functionality, not sure. I will probably wrap up what I have, put a lace on it, and submit it as a PR. Even if it is not destined to be merged, it may serve as a warning to others.</div>


<div><br></div><div>I have also been thinking lately that one of the problems with all these unique-related computations may be a case of having a hammer and seeing everything as nails. Perhaps the algorithm that needs to be ported from Python is not the sorting one, but the hash table...</div>


</div><div><br>Jaime<br clear="all"><div><br></div></div></div></div></blockquote><div><br></div><div> Not sure about the hashing. Indeed one can also build an index of a set by means of a hash table, but its questionable if this leads to improved performance over performing an argsort. Hashing may have better asymptotic time complexity in theory, but many datasets used in practice are very easy to sort (O(N)-ish), and the time-constant of hashing is higher. But more importantly, using a hash-table guarantees poor cache behavior for many operations using this index. By contrast, sorting may (but need not) make one random access pass to build the index, and may (but need not) perform one random access to reorder values for grouping. But insofar as the keys are better behaved than pure random, this coherence will be exploited.<br>

<br>Also, getting the unique values/keys in sorted order is a nice side-benefit for many applications.<br></div></div></div></div>