A Wednesday 13 February 2008, Charles R Harris escrigué:
On Feb 13, 2008 10:56 AM, Francesc Altet <faltet@carabos.com> wrote:
Be warned, I'd like to stress out that these are my figures for my _own laptop_. It would be nice if you can verify all of this with other achitectures (your Core2 machine seems different enough). I can run the benchmarks on Windows (installed in the same laptop) too. Tell me if you are interested on me doing this.
Its easy enough to test if you compile from svn, just add your new copy function and change the name in this line:
#copy=copy_string, copy_ucs4#
to use your function instead of copy_string.
I've spoken too fast. I've never compiled NumPy on Windows before, and had problems when trying to compile it using MSVC 7.1 and a recent copy of the repository. Well, in any case, I've exercised the Opteron platform, using gcc 4.1.3 (i.e. the one that can optimize newqsort correctly), and this brings new light to our study. From the plot (attached), it can be drawn the next conclusions: 1) copy_string2 (the combination of manual copy and memcpy) is not better than memcpy for *any* value of the string length in our Opteron platform. Also, the improvements with Pentium4 was not that big (<20%). In consequence, I'd choose to always use memcpy and discard copy_string2. 2) Curiously enough, the indirect sort in Opterons is *always* faster than newqsort+memcpy. For large values of string lengths (> 256), the speed-up can be up to 3x, which is a lot. And I'd say that this keeps true also for most modern processors (read Core2, Barcelona). For older processors (Pentium4), the indirect method can be slower than direct plot for small lengths, but by a very few extent (<10%). Conclusion 2 makes me wonder if it wouldn't be useful the introduction of a new flag in sort, like: """ `indirect` - Use the indirect method for sorting. This requires more memory (4/8 bytes per array element), but for sorting arrays of strings it is almost always faster than the direct approach (default). Beware: this is not the case when using numerical values, where the use of this method for sorting is not recommendable. """ Agreed, that could introduce some confusion, but as this flag would be 'False' by default, not many people should bother about its existence, and can definitely help to people who cares about sorting string performance. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
------------------------------------------------------- --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Thu, Feb 14, 2008 at 9:11 AM, Francesc Altet <faltet@carabos.com> wrote:
A Wednesday 13 February 2008, Charles R Harris escrigué:
On Feb 13, 2008 10:56 AM, Francesc Altet <faltet@carabos.com> wrote:
Be warned, I'd like to stress out that these are my figures for my _own laptop_. It would be nice if you can verify all of this with other achitectures (your Core2 machine seems different enough). I can run the benchmarks on Windows (installed in the same laptop) too. Tell me if you are interested on me doing this.
Its easy enough to test if you compile from svn, just add your new copy function and change the name in this line:
#copy=copy_string, copy_ucs4#
to use your function instead of copy_string.
I've spoken too fast. I've never compiled NumPy on Windows before, and had problems when trying to compile it using MSVC 7.1 and a recent copy of the repository. Well, in any case, I've exercised the Opteron platform, using gcc 4.1.3 (i.e. the one that can optimize newqsort correctly), and this brings new light to our study.
From the plot (attached), it can be drawn the next conclusions:
1) copy_string2 (the combination of manual copy and memcpy) is not better than memcpy for *any* value of the string length in our Opteron platform. Also, the improvements with Pentium4 was not that big (<20%). In consequence, I'd choose to always use memcpy and discard copy_string2.
Your copy_string2 is now the version in numpy. I'm hesitant to make memcpy the default for all string lengths because I've read that memcpy was much improved in later gcc (>= 4.1 ?), but known slow in older versions. So perhaps in a year or two when the newer compilers are more widespread would be a better time to make the change. The switch at the 16 char length shouldn't make that much difference in practice. I'll put a comment in the source so that the thought won't get lost. BTW, using copy_string2 much improved the performance of the string mergesort where a lot of data needs to be copied to the work array. It's now half as fast as quicksort instead of 1/3 ;) Heap sort continues in it's traditional slot as the slowest of all. Slow but safe.
2) Curiously enough, the indirect sort in Opterons is *always* faster than newqsort+memcpy. For large values of string lengths (> 256), the speed-up can be up to 3x, which is a lot. And I'd say that this keeps true also for most modern processors (read Core2, Barcelona). For older processors (Pentium4), the indirect method can be slower than direct plot for small lengths, but by a very few extent (<10%).
The new indirect quicksort for strings is faster than the old qsort based default, so perhaps that is also making a difference.
Conclusion 2 makes me wonder if it wouldn't be useful the introduction of a new flag in sort, like:
""" `indirect` - Use the indirect method for sorting. This requires more memory (4/8 bytes per array element), but for sorting arrays of strings it is almost always faster than the direct approach (default). Beware: this is not the case when using numerical values, where the use of this method for sorting is not recommendable. """
I'm more inclined to leave this to the user. I have a todo to add a function to numpy that makes it easier to use the argsort output to sort multidimensional arrays, I'll name it argtake or some such and it will use the argsort output along with an axis argument. It won't be quite as memory efficient for multidimensional arrays, but it should work about the same in the 1D case. Chuck
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 9:11 AM, Francesc Altet <faltet@carabos.com> wrote:
From the plot (attached), it can be drawn the next conclusions:
1) copy_string2 (the combination of manual copy and memcpy) is not better than memcpy for *any* value of the string length in our Opteron platform. Also, the improvements with Pentium4 was not that big (<20%). In consequence, I'd choose to always use memcpy and discard copy_string2.
Your copy_string2 is now the version in numpy. I'm hesitant to make memcpy the default for all string lengths because I've read that memcpy was much improved in later gcc (>= 4.1 ?), but known slow in older versions. So perhaps in a year or two when the newer compilers are more widespread would be a better time to make the change. The switch at the 16 char length shouldn't make that much difference in practice. I'll put a comment in the source so that the thought won't get lost.
Well, copy_string2 is only marginally slower than memcpy on modern gcc compilers and processors, so I presume that this is fine.
BTW, using copy_string2 much improved the performance of the string mergesort where a lot of data needs to be copied to the work array. It's now half as fast as quicksort instead of 1/3 ;) Heap sort continues in it's traditional slot as the slowest of all. Slow but safe.
I've seen that you have added specific code for merge and heap sorting algorithms for strings. Looks good.
2) Curiously enough, the indirect sort in Opterons is *always* faster than newqsort+memcpy. For large values of string lengths (> 256), the speed-up can be up to 3x, which is a lot. And I'd say that this keeps true also for most modern processors (read Core2, Barcelona). For older processors (Pentium4), the indirect method can be slower than direct plot for small lengths, but by a very few extent (<10%).
The new indirect quicksort for strings is faster than the old qsort based default, so perhaps that is also making a difference.
Yes, indeed it does!
Conclusion 2 makes me wonder if it wouldn't be useful the introduction of a new flag in sort, like:
""" `indirect` - Use the indirect method for sorting. This requires more memory (4/8 bytes per array element), but for sorting arrays of strings it is almost always faster than the direct approach (default). Beware: this is not the case when using numerical values, where the use of this method for sorting is not recommendable. """
I'm more inclined to leave this to the user. I have a todo to add a function to numpy that makes it easier to use the argsort output to sort multidimensional arrays, I'll name it argtake or some such and it will use the argsort output along with an axis argument. It won't be quite as memory efficient for multidimensional arrays, but it should work about the same in the 1D case.
OK. I don't completely grasp what are you trying to do here, but seems like a conservative enough path (in the sense that it won't touch the current parameters of existing sorting methods). Looking forward to see the new qsort for strings in NumPy (the specific version for merge sort is very welcome too!). Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet <faltet@carabos.com> wrote:
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 9:11 AM, Francesc Altet <faltet@carabos.com> wrote:
From the plot (attached), it can be drawn the next conclusions:
1) copy_string2 (the combination of manual copy and memcpy) is not better than memcpy for *any* value of the string length in our Opteron platform. Also, the improvements with Pentium4 was not that big (<20%). In consequence, I'd choose to always use memcpy and discard copy_string2.
Your copy_string2 is now the version in numpy. I'm hesitant to make memcpy the default for all string lengths because I've read that memcpy was much improved in later gcc (>= 4.1 ?), but known slow in older versions. So perhaps in a year or two when the newer compilers are more widespread would be a better time to make the change. The switch at the 16 char length shouldn't make that much difference in practice. I'll put a comment in the source so that the thought won't get lost.
Well, copy_string2 is only marginally slower than memcpy on modern gcc compilers and processors, so I presume that this is fine.
BTW, using copy_string2 much improved the performance of the string mergesort where a lot of data needs to be copied to the work array. It's now half as fast as quicksort instead of 1/3 ;) Heap sort continues in it's traditional slot as the slowest of all. Slow but safe.
I've seen that you have added specific code for merge and heap sorting algorithms for strings. Looks good.
2) Curiously enough, the indirect sort in Opterons is *always* faster than newqsort+memcpy. For large values of string lengths (> 256), the speed-up can be up to 3x, which is a lot. And I'd say that this keeps true also for most modern processors (read Core2, Barcelona). For older processors (Pentium4), the indirect method can be slower than direct plot for small lengths, but by a very few extent (<10%).
The new indirect quicksort for strings is faster than the old qsort based default, so perhaps that is also making a difference.
Yes, indeed it does!
Conclusion 2 makes me wonder if it wouldn't be useful the introduction of a new flag in sort, like:
""" `indirect` - Use the indirect method for sorting. This requires more memory (4/8 bytes per array element), but for sorting arrays of strings it is almost always faster than the direct approach (default). Beware: this is not the case when using numerical values, where the use of this method for sorting is not recommendable. """
I'm more inclined to leave this to the user. I have a todo to add a function to numpy that makes it easier to use the argsort output to sort multidimensional arrays, I'll name it argtake or some such and it will use the argsort output along with an axis argument. It won't be quite as memory efficient for multidimensional arrays, but it should work about the same in the 1D case.
OK. I don't completely grasp what are you trying to do here, but seems like a conservative enough path (in the sense that it won't touch the current parameters of existing sorting methods).
Looking forward to see the new qsort for strings in NumPy (the specific version for merge sort is very welcome too!).
I could never figure out what the merge sort was good for. I did the indirect version in numarray because I needed a stable sort to implement lexsort, which was my original aim. I just added the direct version for completeness. If you have a use for it, I would love to hear it. Chuck
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet <faltet@carabos.com> wrote:
Looking forward to see the new qsort for strings in NumPy (the specific version for merge sort is very welcome too!).
I could never figure out what the merge sort was good for. I did the indirect version in numarray because I needed a stable sort to implement lexsort, which was my original aim. I just added the direct version for completeness. If you have a use for it, I would love to hear it.
Well, I must to confess that I've not used merge sorts yet, but I'd like to test them in the context of my PSI (Partially Sorted Indexes, see [1] for a white paper on a concrete implementation) work. My hope is that, as a merge sort keeps the order of indices of elements that are equal (this is what 'stable' means), this would allow better compression rates for indices (and hence, less I/O effort to bring the indices from disk into memory and ultimately allowing for faster lookup speed). This will probably be only important when one have data distributions with rather low cardinality, but these scenarios can be more frequent/important than one may think. [1] http://www.carabos.com/docs/OPSI-indexes.pdf --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Thu, Feb 14, 2008 at 11:44 AM, Francesc Altet <faltet@carabos.com> wrote:
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet <faltet@carabos.com> wrote:
Looking forward to see the new qsort for strings in NumPy (the specific version for merge sort is very welcome too!).
I could never figure out what the merge sort was good for. I did the indirect version in numarray because I needed a stable sort to implement lexsort, which was my original aim. I just added the direct version for completeness. If you have a use for it, I would love to hear it.
Well, I must to confess that I've not used merge sorts yet, but I'd like to test them in the context of my PSI (Partially Sorted Indexes, see [1] for a white paper on a concrete implementation) work. My hope is that, as a merge sort keeps the order of indices of elements that are equal (this is what 'stable' means), this would allow better compression rates for indices (and hence, less I/O effort to bring the indices from disk into memory and ultimately allowing for faster lookup speed). This will probably be only important when one have data distributions with rather low cardinality, but these scenarios can be more frequent/important than one may think.
Well, I take that back a bit. I think mergesort might work best for large memory mapped arrays because it does sequential accesses, which might be more disk efficient than random accesses. Then again, a divide and conquer approach like quicksort eventually becomes localized too. I've never experimented with really large sorts, they might perform differently than the sorts that fit in memory. Insertion sort is supposed to work well for almost sorted sequences, but that application has always seemed a bit specialized to me. Although I'll admit to being occasionally tempted to pull the insertion sorts out of quicksort and mergesort and make them their own (type specific) routines. Chuck
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 11:44 AM, Francesc Altet <faltet@carabos.com> wrote:
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet <faltet@carabos.com>
wrote:
Looking forward to see the new qsort for strings in NumPy (the specific version for merge sort is very welcome too!).
I could never figure out what the merge sort was good for. I did the indirect version in numarray because I needed a stable sort to implement lexsort, which was my original aim. I just added the direct version for completeness. If you have a use for it, I would love to hear it.
Well, I must to confess that I've not used merge sorts yet, but I'd like to test them in the context of my PSI (Partially Sorted Indexes, see [1] for a white paper on a concrete implementation) work. My hope is that, as a merge sort keeps the order of indices of elements that are equal (this is what 'stable' means), this would allow better compression rates for indices (and hence, less I/O effort to bring the indices from disk into memory and ultimately allowing for faster lookup speed). This will probably be only important when one have data distributions with rather low cardinality, but these scenarios can be more frequent/important than one may think.
Well, I take that back a bit. I think mergesort might work best for large memory mapped arrays because it does sequential accesses, which might be more disk efficient than random accesses. Then again, a divide and conquer approach like quicksort eventually becomes localized too. I've never experimented with really large sorts, they might perform differently than the sorts that fit in memory.
Yeah, but I don't really want to use merge sort for out-of-core sorting, but just because it is stable. The main point of a PSI indexing schema is that you don't need to completely sort your dataset (hence the name: "Partially Sorted") in order to get an usable index, and this normally leads to much faster index creation times.
Insertion sort is supposed to work well for almost sorted sequences, but that application has always seemed a bit specialized to me. Although I'll admit to being occasionally tempted to pull the insertion sorts out of quicksort and mergesort and make them their own (type specific) routines.
Maybe I'd also be interested in trying insertion sort out. During the optimization process of an OPSI index, there is a need to sort out a slice of data that is already made of smaller chunks that are already sorted, so chances are that insertion sort could be significantly faster than the merge sort (or even the quick sort) in this scenario. But this is becoming an OT. However, I'd be glad to further dicuss this privately, if you like to. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Thu, Feb 14, 2008 at 1:03 PM, Francesc Altet <faltet@carabos.com> wrote:
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 11:44 AM, Francesc Altet <faltet@carabos.com> wrote:
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 10:46 AM, Francesc Altet <faltet@carabos.com>
wrote:
Looking forward to see the new qsort for strings in NumPy (the specific version for merge sort is very welcome too!).
I could never figure out what the merge sort was good for. I did the indirect version in numarray because I needed a stable sort to implement lexsort, which was my original aim. I just added the direct version for completeness. If you have a use for it, I would love to hear it.
Well, I must to confess that I've not used merge sorts yet, but I'd like to test them in the context of my PSI (Partially Sorted Indexes, see [1] for a white paper on a concrete implementation) work. My hope is that, as a merge sort keeps the order of indices of elements that are equal (this is what 'stable' means), this would allow better compression rates for indices (and hence, less I/O effort to bring the indices from disk into memory and ultimately allowing for faster lookup speed). This will probably be only important when one have data distributions with rather low cardinality, but these scenarios can be more frequent/important than one may think.
Well, I take that back a bit. I think mergesort might work best for large memory mapped arrays because it does sequential accesses, which might be more disk efficient than random accesses. Then again, a divide and conquer approach like quicksort eventually becomes localized too. I've never experimented with really large sorts, they might perform differently than the sorts that fit in memory.
Yeah, but I don't really want to use merge sort for out-of-core sorting, but just because it is stable. The main point of a PSI indexing schema is that you don't need to completely sort your dataset (hence the name: "Partially Sorted") in order to get an usable index, and this normally leads to much faster index creation times.
Insertion sort is supposed to work well for almost sorted sequences, but that application has always seemed a bit specialized to me. Although I'll admit to being occasionally tempted to pull the insertion sorts out of quicksort and mergesort and make them their own (type specific) routines.
Maybe I'd also be interested in trying insertion sort out. During the optimization process of an OPSI index, there is a need to sort out a slice of data that is already made of smaller chunks that are already sorted, so chances are that insertion sort could be significantly faster than the merge sort (or even the quick sort) in this scenario.
But this is becoming an OT. However, I'd be glad to further dicuss this privately, if you like to.
Well, I don't have much more to say. If you do decide that insertion sort will be useful you won't have to twist my arm much to get it, but I think it is most useful when the data never has to move far. In the case of quicksort and mergesort it is called to deal with small unsorted chunks, but the chunks themselves are already in the right place. Some kind of multi merge sort might be more appropriate to the OPSI index. Chuck
A Thursday 14 February 2008, Charles R Harris escrigué:
On Thu, Feb 14, 2008 at 1:03 PM, Francesc Altet <faltet@carabos.com> wrote:
Maybe I'd also be interested in trying insertion sort out. During the optimization process of an OPSI index, there is a need to sort out a slice of data that is already made of smaller chunks that are already sorted, so chances are that insertion sort could be significantly faster than the merge sort (or even the quick sort) in this scenario.
But this is becoming an OT. However, I'd be glad to further dicuss this privately, if you like to.
Well, I don't have much more to say. If you do decide that insertion sort will be useful you won't have to twist my arm much to get it, but I think it is most useful when the data never has to move far. In the case of quicksort and mergesort it is called to deal with small unsorted chunks, but the chunks themselves are already in the right place. Some kind of multi merge sort might be more appropriate to the OPSI index.
OK, thanks, I'll have you in mind ;) And yes, multi-way merge seems very interesting for OPSI indeed. Eventually, it might even be a good candidate for taking advantage of the several cores in modern CPU's; so it would be really interesting to check out this path. Thanks for very insightful hints! --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Hi Chuck, I've given more testing to the new quicksort routines for strings in the forthcoming NumPy. I've run the indexing test units in PyTables Pro (they stress the sorting routines a lot) against the current version of NumPy in the repository, for the complete set of quicksort, mergesort and heapsort of the new implementation, and I'm happy to say to everything went very smoothly, i.e. more than 1000 tests with different input arrays has passed flawlessly. Good job! Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Fri, Feb 15, 2008 at 5:09 AM, Francesc Altet <faltet@carabos.com> wrote:
Hi Chuck,
I've given more testing to the new quicksort routines for strings in the forthcoming NumPy. I've run the indexing test units in PyTables Pro (they stress the sorting routines a lot) against the current version of NumPy in the repository, for the complete set of quicksort, mergesort and heapsort of the new implementation, and I'm happy to say to everything went very smoothly, i.e. more than 1000 tests with different input arrays has passed flawlessly. Good job!
Hi Francesc, Thanks for the thorough testing. It makes me feel much more comfortable this close to the release of 1.0.5. Chuck
A Friday 15 February 2008, Charles R Harris escrigué:
On Fri, Feb 15, 2008 at 5:09 AM, Francesc Altet <faltet@carabos.com> wrote:
Hi Chuck,
I've given more testing to the new quicksort routines for strings in the forthcoming NumPy. I've run the indexing test units in PyTables Pro (they stress the sorting routines a lot) against the current version of NumPy in the repository, for the complete set of quicksort, mergesort and heapsort of the new implementation, and I'm happy to say to everything went very smoothly, i.e. more than 1000 tests with different input arrays has passed flawlessly. Good job!
Hi Francesc,
Thanks for the thorough testing. It makes me feel much more comfortable this close to the release of 1.0.5.
You are welcome. I forgot to mention that I tested out both direct (.sort()) and indirect (.argsort()) methods. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
participants (2)
-
Charles R Harris -
Francesc Altet