[Numpy-discussion] String sort

Thu Feb 14 15:03:48 EST 2008

A Thursday 14 February 2008, Charles R Harris escrigué:
> On Thu, Feb 14, 2008 at 11:44 AM, Francesc Altet <faltet at carabos.com> 
wrote:
> > A Thursday 14 February 2008, Charles R Harris escrigué:
> > > On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet
> > > <faltet at carabos.com>
> >
> > wrote:
> > > > Looking forward to see the new qsort for strings in NumPy (the
> > > > specific version for merge sort is very welcome too!).
> > >
> > > I could never figure out what the merge sort was good for. I did
> > > the indirect version in numarray because I needed a stable sort
> > > to implement lexsort, which was my original aim. I just added the
> > > direct version for completeness. If you have a use for it, I
> > > would love to hear it.
> >
> > Well, I must to confess that I've not used merge sorts yet, but I'd
> > like to test them in the context of my PSI (Partially Sorted
> > Indexes, see [1] for a white paper on a concrete implementation)
> > work.  My hope is that, as a merge sort keeps the order of indices
> > of elements that are equal (this is what 'stable' means), this
> > would allow better compression rates for indices (and hence, less
> > I/O effort to bring the indices from disk into memory and
> > ultimately allowing for faster lookup speed).  This will probably
> > be only important when one have data distributions with rather low
> > cardinality, but these scenarios can be more frequent/important
> > than one may think.
>
> Well, I take that back a bit. I think mergesort might work best for
> large memory mapped arrays because it does sequential accesses, which
> might be more disk efficient than random accesses. Then again, a
> divide and conquer approach like quicksort eventually becomes
> localized too. I've never experimented with really large sorts, they
> might perform differently than the sorts that fit in memory.

Yeah, but I don't really want to use merge sort for out-of-core sorting, 
but just because it is stable.  The main point of a PSI indexing schema 
is that you don't need to completely sort your dataset (hence the 
name: "Partially Sorted") in order to get an usable index, and this 
normally leads to much faster index creation times.

> Insertion sort is supposed to work well for almost sorted sequences,
> but that application has always seemed a bit specialized to me.
> Although I'll admit to being occasionally tempted to pull the
> insertion sorts out of quicksort and mergesort and make them their
> own (type specific) routines.

Maybe I'd also be interested in trying insertion sort out.  During the 
optimization process of an OPSI index, there is a need to sort out a 
slice of data that is already made of smaller chunks that are already 
sorted, so chances are that insertion sort could be significantly 
faster than the merge sort (or even the quick sort) in this scenario.

But this is becoming an OT.  However, I'd be glad to further dicuss this 
privately, if you like to.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"