[Numpy-discussion] String sort

Charles R Harris charlesr.harris at gmail.com
Thu Feb 14 15:23:39 EST 2008


On Thu, Feb 14, 2008 at 1:03 PM, Francesc Altet <faltet at carabos.com> wrote:

> A Thursday 14 February 2008, Charles R Harris escrigué:
> > On Thu, Feb 14, 2008 at 11:44 AM, Francesc Altet <faltet at carabos.com>
> wrote:
> > > A Thursday 14 February 2008, Charles R Harris escrigué:
> > > > On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet
> > > > <faltet at carabos.com>
> > >
> > > wrote:
> > > > > Looking forward to see the new qsort for strings in NumPy (the
> > > > > specific version for merge sort is very welcome too!).
> > > >
> > > > I could never figure out what the merge sort was good for. I did
> > > > the indirect version in numarray because I needed a stable sort
> > > > to implement lexsort, which was my original aim. I just added the
> > > > direct version for completeness. If you have a use for it, I
> > > > would love to hear it.
> > >
> > > Well, I must to confess that I've not used merge sorts yet, but I'd
> > > like to test them in the context of my PSI (Partially Sorted
> > > Indexes, see [1] for a white paper on a concrete implementation)
> > > work.  My hope is that, as a merge sort keeps the order of indices
> > > of elements that are equal (this is what 'stable' means), this
> > > would allow better compression rates for indices (and hence, less
> > > I/O effort to bring the indices from disk into memory and
> > > ultimately allowing for faster lookup speed).  This will probably
> > > be only important when one have data distributions with rather low
> > > cardinality, but these scenarios can be more frequent/important
> > > than one may think.
> >
> > Well, I take that back a bit. I think mergesort might work best for
> > large memory mapped arrays because it does sequential accesses, which
> > might be more disk efficient than random accesses. Then again, a
> > divide and conquer approach like quicksort eventually becomes
> > localized too. I've never experimented with really large sorts, they
> > might perform differently than the sorts that fit in memory.
>
> Yeah, but I don't really want to use merge sort for out-of-core sorting,
> but just because it is stable.  The main point of a PSI indexing schema
> is that you don't need to completely sort your dataset (hence the
> name: "Partially Sorted") in order to get an usable index, and this
> normally leads to much faster index creation times.
>
> > Insertion sort is supposed to work well for almost sorted sequences,
> > but that application has always seemed a bit specialized to me.
> > Although I'll admit to being occasionally tempted to pull the
> > insertion sorts out of quicksort and mergesort and make them their
> > own (type specific) routines.
>
> Maybe I'd also be interested in trying insertion sort out.  During the
> optimization process of an OPSI index, there is a need to sort out a
> slice of data that is already made of smaller chunks that are already
> sorted, so chances are that insertion sort could be significantly
> faster than the merge sort (or even the quick sort) in this scenario.
>
> But this is becoming an OT.  However, I'd be glad to further dicuss this
> privately, if you like to.
>

Well, I don't have much more to say. If you do decide that insertion sort
will be useful you won't have to twist my arm much to get it, but I think it
is most useful when the data never has to move far. In the case of quicksort
and mergesort it is called to deal with small unsorted chunks, but the
chunks themselves are already in the right place. Some kind of multi merge
sort might be more appropriate to the OPSI index.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080214/dfd64d19/attachment.html>


More information about the NumPy-Discussion mailing list