Hi, I'm a bit confused that the sort method of a string character doesn't allow a mergesort:
s = numpy.empty(10, "S10") s.sort(kind="merge") TypeError: desired sort not supported for this type
However, by looking at the numpy sources, it seems that the only implemented method for sorting array strings is "merge" (I presume because it is stable). So, perhaps the message above should be fixed. Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 8, 2008 5:29 AM, Francesc Altet <faltet@carabos.com> wrote:
Hi,
I'm a bit confused that the sort method of a string character doesn't allow a mergesort:
s = numpy.empty(10, "S10") s.sort(kind="merge") TypeError: desired sort not supported for this type
I think it's an error parsing the keyword. In fact, I thought I fixed that, but maybe I was waiting till I added the other methods.
However, by looking at the numpy sources, it seems that the only implemented method for sorting array strings is "merge" (I presume because it is stable). So, perhaps the message above should be fixed.
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend. Chuck
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
A Friday 08 February 2008, Francesc Altet escrigué:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
Ops. I've introduced a last-minute problem in my code. To fix this, just replace the flawed opt_strncmp() that I sent before by: static int inline opt_strncmp(char *a, char *b, int n) { int i; for (i=0; i<n; i++) { if (a[i] > b[i]) return i+1; if (a[i] < b[i]) return -(i+1); /* Another way, but seems equivalent in speed, at least here */ /* if (a[i] != b[i]) */ /* return (((unsigned char *)a)[i] - ((unsigned char *)b)[i]); */ } return 0; } Apparently, this version works just fine. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 8, 2008 10:31 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Francesc Altet escrigué:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
Ops. I've introduced a last-minute problem in my code. To fix this, just replace the flawed opt_strncmp() that I sent before by:
static int inline opt_strncmp(char *a, char *b, int n) { int i; for (i=0; i<n; i++) { if (a[i] > b[i]) return i+1; if (a[i] < b[i]) return -(i+1); /* Another way, but seems equivalent in speed, at least here */ /* if (a[i] != b[i]) */ /* return (((unsigned char *)a)[i] - ((unsigned char *)b)[i]); */ } return 0; }
Apparently, this version works just fine.
Did you find this significantly faster than strncmp? There is also a unicode compare, do you have thoughts about that? Chuck
On Feb 8, 2008 11:07 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Feb 8, 2008 10:31 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Francesc Altet escrigué:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
Ops. I've introduced a last-minute problem in my code. To fix this, just replace the flawed opt_strncmp() that I sent before by:
static int inline opt_strncmp(char *a, char *b, int n) { int i; for (i=0; i<n; i++) { if (a[i] > b[i]) return i+1; if (a[i] < b[i]) return -(i+1); /* Another way, but seems equivalent in speed, at least here */ /* if (a[i] != b[i]) */ /* return (((unsigned char *)a)[i] - ((unsigned char *)b)[i]); */ } return 0; }
Apparently, this version works just fine.
Did you find this significantly faster than strncmp? There is also a unicode compare, do you have thoughts about that?
Does anyone know if inline is compiler safe/portable these days. There are definitely places where I would like to use it, but I've been hesitant. Chuck
A Friday 08 February 2008, Charles R Harris escrigué:
On Feb 8, 2008 10:31 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Francesc Altet escrigué:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
Ops. I've introduced a last-minute problem in my code. To fix this, just replace the flawed opt_strncmp() that I sent before by:
static int inline opt_strncmp(char *a, char *b, int n) { int i; for (i=0; i<n; i++) { if (a[i] > b[i]) return i+1; if (a[i] < b[i]) return -(i+1); /* Another way, but seems equivalent in speed, at least here */ /* if (a[i] != b[i]) */ /* return (((unsigned char *)a)[i] - ((unsigned char *)b)[i]); */ } return 0; }
Apparently, this version works just fine.
Did you find this significantly faster than strncmp? There is also a unicode compare, do you have thoughts about that?
Well, for the unicode case it wouldn't be enough by replacing 'char' by 'Py_ArrayUCS4'? Maybe this afternoon I can do some benchmarking too in this regard. --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 9, 2008 6:47 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Charles R Harris escrigué:
On Feb 8, 2008 10:31 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Francesc Altet escrigué:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
Ops. I've introduced a last-minute problem in my code. To fix this, just replace the flawed opt_strncmp() that I sent before by:
static int inline opt_strncmp(char *a, char *b, int n) { int i; for (i=0; i<n; i++) { if (a[i] > b[i]) return i+1; if (a[i] < b[i]) return -(i+1); /* Another way, but seems equivalent in speed, at least here */ /* if (a[i] != b[i]) */ /* return (((unsigned char *)a)[i] - ((unsigned char *)b)[i]); */ } return 0; }
Apparently, this version works just fine.
Did you find this significantly faster than strncmp? There is also a unicode compare, do you have thoughts about that?
Well, for the unicode case it wouldn't be enough by replacing 'char' by 'Py_ArrayUCS4'? Maybe this afternoon I can do some benchmarking too in this regard.
Looks like that for Numpy. The problem I was thinking about is that for wide characters Windows C defaults to UTF16 while the Unixes default to UTF32. The C99 standard didn't specify the exact length, but Numpy seems to use (or assume) UTF32. Anyway, after doing some work to fool the optimizer and subtracting loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs 16e-9 seconds to compare strings of length 10. I've attached the program. Note that on my machine malloc appears to return zeroed memory, so the string compares always go to the end. Chuck
A Saturday 09 February 2008, Charles R Harris escrigué:
Well, for the unicode case it wouldn't be enough by replacing 'char' by 'Py_ArrayUCS4'? Maybe this afternoon I can do some benchmarking too in this regard.
Looks like that for Numpy. The problem I was thinking about is that for wide characters Windows C defaults to UTF16 while the Unixes default to UTF32.
If it were so simple ;-) The fact is that the Python crew is delivering the tarballs ready to compile with the UCS2 as default, and this applies to both UNIX and Windows. However, some Linux distributions (most in particular, Debian and derivatives), has chosen to make UCS4 the default in their Python packages. This is not a (big) problem in itself, but when it comes to writing arrays on disk and hope for portability (not only with different platforms, but also with different UCS python interpreter in the same machine!), we realized that this was a real problem (see discussion in [1]). So, NumPy had to make a decision in that regard, and Travis finally opted to only give support for the UCS4 charset in NumPy [2]. Also, he opened the door to possible UCS2 implementations in NumPy in the future, but that would be a real pain, IMHO. [1]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006081.ht... [2]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006130.ht... So, at least for the time being, you only have to worry about UCS4.
The C99 standard didn't specify the exact length, but Numpy seems to use (or assume) UTF32.
Well, I should say that UTF32 and UCS4 are names referring to the same thing, but most literature (and specially package configuration procedures) talks about UCS4.
Anyway, after doing some work to fool the optimizer and subtracting loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs 16e-9 seconds to compare strings of length 10. I've attached the program. Note that on my machine malloc appears to return zeroed memory, so the string compares always go to the end.
I've seen the benchmark, and the problem is that C strncmp stops checking when it finds a \0 in the first string, while strncmp1 have to check the complete set of chars in strings. However, you won't really want to do C string comparisons with NumPy strings: In [35]: ns1 = numpy.array("as\0as") In [36]: ns2 = numpy.array("as\0bs") In [37]: ns1 == ns2 Out[37]: array(False, dtype=bool) In [38]: ns1 < ns2 Out[38]: array(True, dtype=bool) or, with Python strings, in general: In [39]: ns1 = "as\0as" In [40]: ns2 = "as\0bs" In [41]: ns1 == ns2 Out[41]: False In [42]: ns1 < ns2 Out[42]: True As you see, Python/NumPy strings are different beasts than C strings in that regard. The strings in the latter always end with a \0 (NULL) character, while in Python/NumPy the end is defined by a length property (btw, the same than in Pascal, if you know it). So, strncmp1 is not only faster than its C counterpart, but also the one doing the correct job with NumPy (unicode) strings. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 9, 2008 11:50 AM, Francesc Altet <faltet@carabos.com> wrote:
A Saturday 09 February 2008, Charles R Harris escrigué:
Well, for the unicode case it wouldn't be enough by replacing 'char' by 'Py_ArrayUCS4'? Maybe this afternoon I can do some benchmarking too in this regard.
Looks like that for Numpy. The problem I was thinking about is that for wide characters Windows C defaults to UTF16 while the Unixes default to UTF32.
If it were so simple ;-) The fact is that the Python crew is delivering the tarballs ready to compile with the UCS2 as default, and this applies to both UNIX and Windows. However, some Linux distributions (most in particular, Debian and derivatives), has chosen to make UCS4 the default in their Python packages.
This is not a (big) problem in itself, but when it comes to writing arrays on disk and hope for portability (not only with different platforms, but also with different UCS python interpreter in the same machine!), we realized that this was a real problem (see discussion in [1]). So, NumPy had to make a decision in that regard, and Travis finally opted to only give support for the UCS4 charset in NumPy [2]. Also, he opened the door to possible UCS2 implementations in NumPy in the future, but that would be a real pain, IMHO.
[1]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006081.ht...
[2]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006130.ht...
So, at least for the time being, you only have to worry about UCS4.
The C99 standard didn't specify the exact length, but Numpy seems to use (or assume) UTF32.
Well, I should say that UTF32 and UCS4 are names referring to the same thing, but most literature (and specially package configuration procedures) talks about UCS4.
Anyway, after doing some work to fool the optimizer and subtracting loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs 16e-9 seconds to compare strings of length 10. I've attached the program. Note that on my machine malloc appears to return zeroed memory, so the string compares always go to the end.
I've seen the benchmark, and the problem is that C strncmp stops checking when it finds a \0 in the first string, while strncmp1 have to check the complete set of chars in strings. However, you won't really want to do C string comparisons with NumPy strings:
In [35]: ns1 = numpy.array("as\0as")
In [36]: ns2 = numpy.array("as\0bs")
In [37]: ns1 == ns2 Out[37]: array(False, dtype=bool)
In [38]: ns1 < ns2 Out[38]: array(True, dtype=bool)
or, with Python strings, in general:
In [39]: ns1 = "as\0as"
In [40]: ns2 = "as\0bs"
In [41]: ns1 == ns2 Out[41]: False
In [42]: ns1 < ns2 Out[42]: True
As you see, Python/NumPy strings are different beasts than C strings in that regard. The strings in the latter always end with a \0 (NULL) character, while in Python/NumPy the end is defined by a length property (btw, the same than in Pascal, if you know it).
So, strncmp1 is not only faster than its C counterpart, but also the one doing the correct job with NumPy (unicode) strings.
Ah, in that case the current indirect sort for NumPy strings, which uses strncmp, is incorrect and needs to be fixed. It seems that strings with zeros are not part of the current test series ;) Chuck
A Saturday 09 February 2008, Charles R Harris escrigué:
So, strncmp1 is not only faster than its C counterpart, but also the one doing the correct job with NumPy (unicode) strings.
Ah, in that case the current indirect sort for NumPy strings, which uses strncmp, is incorrect and needs to be fixed. It seems that strings with zeros are not part of the current test series ;)
Yeah, that's right. And yes, it would be advisable to have at least a couple of tests having zeros interspersed throughout the string. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 9, 2008 2:07 PM, Francesc Altet <faltet@carabos.com> wrote:
A Saturday 09 February 2008, Charles R Harris escrigué:
So, strncmp1 is not only faster than its C counterpart, but also the one doing the correct job with NumPy (unicode) strings.
Ah, in that case the current indirect sort for NumPy strings, which uses strncmp, is incorrect and needs to be fixed. It seems that strings with zeros are not part of the current test series ;)
Yeah, that's right. And yes, it would be advisable to have at least a couple of tests having zeros interspersed throughout the string.
Like this should do: In [5]: argsort(fromstring("\0\2\0\1", dtype="|S2")) Out[5]: array([0, 1]) Chuck
A Saturday 09 February 2008, Charles R Harris escrigué:
On Feb 9, 2008 2:07 PM, Francesc Altet <faltet@carabos.com> wrote:
A Saturday 09 February 2008, Charles R Harris escrigué:
So, strncmp1 is not only faster than its C counterpart, but also the one doing the correct job with NumPy (unicode) strings.
Ah, in that case the current indirect sort for NumPy strings, which uses strncmp, is incorrect and needs to be fixed. It seems that strings with zeros are not part of the current test series ;)
Yeah, that's right. And yes, it would be advisable to have at least a couple of tests having zeros interspersed throughout the string.
Like this should do:
In [5]: argsort(fromstring("\0\2\0\1", dtype="|S2")) Out[5]: array([0, 1])
Exactly, but I understand that the correct result should be: array([1, 0]) ;-) Something a bit more complex, like: In [5]: argsort(fromstring("a\0b\0\0\2a\0b\0\0\1", dtype="|S6")) Out[5]: array([1, 0]) wouldn't hurt neither. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Chuck, One more thing on this. I've been doing some benchmarking with my opt_memcpy() macro in the quicksort_string function, and I should say that while it is definitely more efficient than my system memcpy for small values of n (the number of bytes to copy), this doesn't keep true for all values of n. For example, for n<16, opt_memcpy() can be more than 4x faster than system memcpy (and this is why I naively thought that it would be faster in general). However, for n>80, memcpy beats opt_memcpy between a 25% and 100% (depending on whether n is divisible by 2, 4 or 8). This is on my Linux system (Ubuntu 7.10), but perhaps with Windows the behaviour can be different. I think I would be able to come up with a routine that can offer a balance between opt_memcpy and system memcpy, but that should take some time. So, until I (or anybody else) do more research on this, I think it would be safer if you use system memcpy for string sorting in NumPy. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 9, 2008 2:29 PM, Francesc Altet <faltet@carabos.com> wrote:
Chuck,
One more thing on this. I've been doing some benchmarking with my opt_memcpy() macro in the quicksort_string function, and I should say that while it is definitely more efficient than my system memcpy for small values of n (the number of bytes to copy), this doesn't keep true for all values of n. For example, for n<16, opt_memcpy() can be more than 4x faster than system memcpy (and this is why I naively thought that it would be faster in general). However, for n>80, memcpy beats opt_memcpy between a 25% and 100% (depending on whether n is divisible by 2, 4 or 8). This is on my Linux system (Ubuntu 7.10), but perhaps with Windows the behaviour can be different.
I think I would be able to come up with a routine that can offer a balance between opt_memcpy and system memcpy, but that should take some time. So, until I (or anybody else) do more research on this, I think it would be safer if you use system memcpy for string sorting in NumPy.
The memcpy in newer compilers is actually pretty good. For integers and such it sometime compiles inline using integer assignments, but I was loath to make it the default implementation until >= 4.1.x gcc became more common. However, strings might be a good place to use it. Chuck
On Feb 9, 2008 2:42 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Feb 9, 2008 2:29 PM, Francesc Altet <faltet@carabos.com> wrote:
Chuck,
One more thing on this. I've been doing some benchmarking with my opt_memcpy() macro in the quicksort_string function, and I should say that while it is definitely more efficient than my system memcpy for small values of n (the number of bytes to copy), this doesn't keep true for all values of n. For example, for n<16, opt_memcpy() can be more than 4x faster than system memcpy (and this is why I naively thought that it would be faster in general). However, for n>80, memcpy beats opt_memcpy between a 25% and 100% (depending on whether n is divisible by 2, 4 or 8). This is on my Linux system (Ubuntu 7.10), but perhaps with Windows the behaviour can be different.
I think I would be able to come up with a routine that can offer a balance between opt_memcpy and system memcpy, but that should take some time. So, until I (or anybody else) do more research on this, I think it would be safer if you use system memcpy for string sorting in NumPy.
The memcpy in newer compilers is actually pretty good. For integers and such it sometime compiles inline using integer assignments, but I was loath to make it the default implementation until >= 4.1.x gcc became more common. However, strings might be a good place to use it.
I'm also thinking that at some point it becomes more efficient to do a indirect sort followed by take than to move all those big strings around. But I guess we won't know where that point is until we have both versions available. Chuck
A Saturday 09 February 2008, Charles R Harris escrigué:
I'm also thinking that at some point it becomes more efficient to do a indirect sort followed by take than to move all those big strings around. But I guess we won't know where that point is until we have both versions available.
I've done some experiments in that matter too. They are saying that, with the current mergesort in NumPy, an indirect sort followed by take performs similarly to direct sort for small string lengths (<=16), but indirect sort starts to win afterwards. The version with quicksort and optimized sSWAP should be between 2x and 3x times faster than current mergesort implementation, so the advantage for direct sort could grow up to somewhere between 50 and 100. A nice idea could be doing some more toughful experiments in order to find the point where an indirect sort followed by a take would be more efficient and automatically select this method beyond that point. However, this has the drawback that you have to use additional memory for keeping the indices in the indirect method. Of course, when strings are large, those indices should take a rather negligible space compared with strings itself. In any case, in some situations where space is critical, this can still be important. I don't know, but my opinion is that we shouldn't take too agressive optimizations for that matter. My vote is to document this possibility in the docstrings, so that the user wanting for extreme performance can use this approach if he wants to. Still, for string sizes greater than, say, 1000, well, an automatic selection of the indirect method is very tempting indeed. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 9, 2008 3:19 PM, Francesc Altet <faltet@carabos.com> wrote:
A Saturday 09 February 2008, Charles R Harris escrigué:
I'm also thinking that at some point it becomes more efficient to do a indirect sort followed by take than to move all those big strings around. But I guess we won't know where that point is until we have both versions available.
I've done some experiments in that matter too. They are saying that, with the current mergesort in NumPy, an indirect sort followed by take performs similarly to direct sort for small string lengths (<=16), but indirect sort starts to win afterwards.
The version with quicksort and optimized sSWAP should be between 2x and 3x times faster than current mergesort implementation, so the advantage for direct sort could grow up to somewhere between 50 and 100. A nice idea could be doing some more toughful experiments in order to find the point where an indirect sort followed by a take would be more efficient and automatically select this method beyond that point.
However, this has the drawback that you have to use additional memory for keeping the indices in the indirect method. Of course, when strings are large, those indices should take a rather negligible space compared with strings itself. In any case, in some situations where space is critical, this can still be important. I don't know, but my opinion is that we shouldn't take too agressive optimizations for that matter. My vote is to document this possibility in the docstrings, so that the user wanting for extreme performance can use this approach if he wants to. Still, for string sizes greater than, say, 1000, well, an automatic selection of the indirect method is very tempting indeed.
The strings with zeros problem runs deeper than it looked at first glance. Normal sorts don't work either, which means the type has a bad comparison function. And argsort still doesn't work even with the correct comparison function. Python, however, works as it should sorting lists of strings with zeros. So I'm going to have to track down and fix this oddity, but it is going to delay putting in the type specific quicksort for strings. Chuck
On Feb 8, 2008 8:58 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
I ran a few timing tests. On my machine strncmp is about 100x faster than opt_strncmp, but sSWAP (with some fixes), is about 10x faster then using the memcpy in a recent compiler. Does this match with your experience. Chuck
A Friday 08 February 2008, Charles R Harris escrigué:
On Feb 8, 2008 8:58 AM, Francesc Altet <faltet@carabos.com> wrote:
A Friday 08 February 2008, Charles R Harris escrigué:
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I have some code for this too and was going to merge it. Send yours along and I'll get to it this weekend.
Ok, great. I'm attaching it. Tell me if you need some clarification on the code.
I ran a few timing tests. On my machine strncmp is about 100x faster than opt_strncmp, but sSWAP (with some fixes), is about 10x faster then using the memcpy in a recent compiler. Does this match with your experience.
Well, I've run some more exhaustive tests on my laptop (Pentium 4 @ 2 GHz, Ubuntu 7.10, gcc 4.1.3, using -O3 optlevel) with the next sa1 array: numpy.random.seed(1) nelem = 10000 a = numpy.random.rand(nelem) sa1 = a.astype('S16') And I've chosen the next benchmark: /* start: the start of data for sa1 array ss: the length of the string type (16) num: the number of elements in sa1 (10000) */ int sort_S(char *start, int ss, npy_intp num) { char *pl = start; int a = 0; npy_intp i, j; for (i=0; i<num; i++) for (j=0; j<num; j++) a += opt_strncmp(pl+i*ss, pl+j*ss, ss); return a; } So, when I choose the next implementation of opt_strncmp: static int inline opt_strncmp1(char *a, char *b, intp n) { intp i; for (i=0; i<n; i++) { if (a[i] > b[i]) return i+1; if (a[i] < b[i]) return -(i+1); } return 0; } I get a time of 1.70 s. When using the next implementation: static int inline opt_strncmp2(char *a, char *b, intp n) { intp i; for (i=0; i<n; i++) { if (a[i] != b[i]) return (((unsigned char *)a)[i] - ((unsigned char *)b)[i]); } return 0; } I get a time of 1.24 s. The original strncmp gives me 2.05 seconds with this setup. So, in short and using my laptop: original strncmp: 2.05 s opt_stncmp1: 1.70 s ; speed-up: 20% opt_stncmp2: 1.24 s ; speed-up: 65% Just in case, I've also run the benchmark in an Opteron server (2 GHz, SuSe 10.1, gcc 4.2.1, using -O3 optlevel). Here are the results: original strncmp: 1.25 s opt_stncmp1: 1.61 s ; speed-up: -30% opt_stncmp2: 1.03 s ; speed-up: 20% So, I was wrong: opt_stncmp2 is quite more efficient than opt_stncmp1, and definitely better than the original strncmp. So, I would say that using opt_stncmp2 is always the best bet. I'm not certain why are you seeing that original strncmp is 100x faster than opt_strncmpX :-/ Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 8, 2008 5:29 AM, Francesc Altet <faltet@carabos.com> wrote:
Hi,
I'm a bit confused that the sort method of a string character doesn't allow a mergesort:
s = numpy.empty(10, "S10") s.sort(kind="merge") TypeError: desired sort not supported for this type
However, by looking at the numpy sources, it seems that the only implemented method for sorting array strings is "merge" (I presume because it is stable). So, perhaps the message above should be fixed.
The only available method is the default quicksort. The mergesort is for argsort and was put in for lexsort to use.
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I've now got a string/ucs4 specific argsort(kind='q'), the string version of which is about 40% faster than the old default and about 10% faster than the mergesort version, but the string/ucs4 specific versions of sort aren't yet fairing as well. I'm timing things with In [1]: import timeit In [2]: t = timeit.Timer("np.fromstring(np.empty(10000).tostring(),dtype='|S8').sort(kind='q')","import numpy as np") In [3]: t.repeat(3,100) Out[3]: [0.22127485275268555, 0.21282196044921875, 0.21273088455200195] That's with the current sort(kind='q') in svn, which uses the new string compare function but is otherwise the old default quicksort. The new string specific version of quicksort I'm testing is actually a bit slower than that. Both versions correctly sort the array. So I'm going to continue to experiment a bit until I see what is going on. Chuck
A Monday 11 February 2008, Charles R Harris escrigué:
On Feb 8, 2008 5:29 AM, Francesc Altet <faltet@carabos.com> wrote:
Hi,
I'm a bit confused that the sort method of a string character doesn't
allow a mergesort:
s = numpy.empty(10, "S10") s.sort(kind="merge")
TypeError: desired sort not supported for this type
However, by looking at the numpy sources, it seems that the only implemented method for sorting array strings is "merge" (I presume because it is stable). So, perhaps the message above should be fixed.
The only available method is the default quicksort. The mergesort is for argsort and was put in for lexsort to use.
That's good to know. However, I'm curious about where it is the specific quicksort implementation for strings/unicode. I've had a look at _sortmodule.c.src, but I can only find a quicksort implementation for: /**begin repeat #TYPE=BOOL,BYTE,UBYTE,SHORT,USHORT,INT,UINT,LONG,ULONG,LONGLONG,ULONGLONG,FLOAT,DOUBLE,LONGDOUBLE,CFLOAT,CDOUBLE,CLONGDOUBLE# **/ Where are the STRING/UNICODE versions?
Also, in the context of my work in indexing, and because of the slowness of the current implementation in NumPy, I've ended with an implementation of the quicksort method for 1-D array strings. For moderately large arrays, it is about 2.5x-3x faster than the (supposedly) mergesort version in NumPy, not only due to the quicksort, but also because I've implemented a couple of macros for efficient string swapping and copy. If this is of interest for NumPy developers, tell me and I will provide the code.
I've now got a string/ucs4 specific argsort(kind='q'), the string version of which is about 40% faster than the old default and about 10% faster than the mergesort version, but the string/ucs4 specific versions of sort aren't yet fairing as well. I'm timing things with
In [1]: import timeit
In [2]: t = timeit.Timer("np.fromstring(np.empty(10000).tostring(),dtype='|S8').s ort(kind='q')","import numpy as np")
In [3]: t.repeat(3,100) Out[3]: [0.22127485275268555, 0.21282196044921875, 0.21273088455200195]
That's with the current sort(kind='q') in svn, which uses the new string compare function but is otherwise the old default quicksort. The new string specific version of quicksort I'm testing is actually a bit slower than that. Both versions correctly sort the array. So I'm going to continue to experiment a bit until I see what is going on.
The version you are testing is your own or the one that I provided? Here are the timings for my laptop: In [32]: a = np.random.rand(10000).astype('S8') In [33]: %timeit a.copy().sort() # original sort in NumPy 10 loops, best of 3: 16.8 ms per loop In [34]: %timeit newqsort(a.copy()) # My own qsort implementation 100 loops, best of 3: 4.29 ms per loop (I'm using a random string array here mainly because I use the sort in my system NumPy, with the old string compare. However, as all the contents in strings are not NULL chars, the performance should be comparable, bar a few percent of improvement). So, my newqsort still seems to run almost 4x faster than the one in NumPy (you know, using the old string compare). However, when using a server with an Opteron processor, the view is much different: In [55]: a = np.random.rand(10000).astype('S8') In [56]: %timeit a.copy().sort() 100 loops, best of 3: 3.82 ms per loop In [57]: %timeit newqsort(a.copy()) 100 loops, best of 3: 3.29 ms per loop Here, the difference in performance has been reduced to a mere 15% (still favouring newqsort). So, with this, it seems like the performance of the original sorting in NumPy only suffers a lot when running in old processors (eg. Pentium 4), while the performance is reasonable with newer ones (Opteron). On its hand, newqsort seems to perform reasonably well in both. I don't know what exactly is the reason for this (I don't know where it is the code for the original quicksort for strings, so I can't do a visual comparison), but it would be great if we can discover it! Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
A Monday 11 February 2008, Francesc Altet escrigué:
A Monday 11 February 2008, Charles R Harris escrigué:
That's with the current sort(kind='q') in svn, which uses the new string compare function but is otherwise the old default quicksort. The new string specific version of quicksort I'm testing is actually a bit slower than that. Both versions correctly sort the array. So I'm going to continue to experiment a bit until I see what is going on.
The version you are testing is your own or the one that I provided? Here are the timings for my laptop:
In [32]: a = np.random.rand(10000).astype('S8')
In [33]: %timeit a.copy().sort() # original sort in NumPy 10 loops, best of 3: 16.8 ms per loop
In [34]: %timeit newqsort(a.copy()) # My own qsort implementation 100 loops, best of 3: 4.29 ms per loop
(I'm using a random string array here mainly because I use the sort in my system NumPy, with the old string compare. However, as all the contents in strings are not NULL chars, the performance should be comparable, bar a few percent of improvement).
So, my newqsort still seems to run almost 4x faster than the one in NumPy (you know, using the old string compare).
However, when using a server with an Opteron processor, the view is much different:
In [55]: a = np.random.rand(10000).astype('S8')
In [56]: %timeit a.copy().sort() 100 loops, best of 3: 3.82 ms per loop
In [57]: %timeit newqsort(a.copy()) 100 loops, best of 3: 3.29 ms per loop
Here, the difference in performance has been reduced to a mere 15% (still favouring newqsort). So, with this, it seems like the performance of the original sorting in NumPy only suffers a lot when running in old processors (eg. Pentium 4), while the performance is reasonable with newer ones (Opteron). On its hand, newqsort seems to perform reasonably well in both. I don't know what exactly is the reason for this (I don't know where it is the code for the original quicksort for strings, so I can't do a visual comparison), but it would be great if we can discover it!
Mmm, comparing my new strncmp and the one that you have implemented in SVN, I've found a difference that can account for part of the difference in performances. With your version of strncmp in SVN (compare_string), these are my timings with the Opteron server: In [17]: np.random.seed(1) In [18]: a = np.random.rand(10000).astype('S8') In [19]: %timeit a.copy().sort() 100 loops, best of 3: 3.86 ms per loop In [20]: %timeit newqsort(a.copy()) 100 loops, best of 3: 3.44 ms per loop which gives times a 5% worse. Try to use my version and tell me if it does better: static int inline opt_strncmp(char *a, char *b, size_t n) { size_t i; unsigned char c, d; for (i = 0; i < n; i++) { c = a[i]; d = b[i]; if (c != d) return c - d; } return 0; } Although a 5% is maybe too little improvement :-/ --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 11, 2008 4:06 AM, Francesc Altet <faltet@carabos.com> wrote:
A Monday 11 February 2008, Francesc Altet escrigué:
A Monday 11 February 2008, Charles R Harris escrigué:
Mmm, comparing my new strncmp and the one that you have implemented in SVN, I've found a difference that can account for part of the difference in performances. With your version of strncmp in SVN (compare_string), these are my timings with the Opteron server:
<snip>
In [17]: np.random.seed(1)
In [18]: a = np.random.rand(10000).astype('S8')
In [19]: %timeit a.copy().sort() 100 loops, best of 3: 3.86 ms per loop
In [20]: %timeit newqsort(a.copy()) 100 loops, best of 3: 3.44 ms per loop
which gives times a 5% worse. Try to use my version and tell me if it does better:
static int inline opt_strncmp(char *a, char *b, size_t n) { size_t i; unsigned char c, d; for (i = 0; i < n; i++) { c = a[i]; d = b[i]; if (c != d) return c - d; } return 0; }
I didn't notice any speed difference. And while returning the difference of two unsigned numbers should work with modular arithmetic when it is cast to integer, I thought the explicit return based on a compare was clearer and safer. Comparisons always work. I've attached my working _sortmodule.c.src file so you can fool with these different changes on your machines also. This is on top of current svn. Chuck
I will be writing some C code that I will compile into a shared library (.so) on my MacOSX computer to use with ctypes. That code will be calling code from a (big) scientific numerical library (Gnu Scientific Library - GSL) to crunch the numbers. But I don't see how I incorporate that code into the .so file so my shared code can get to it when I call it from Python with ctypes. I do _not_ want to make the GSL callable from Python, only from my own C module. I suspect this isn't a ctypes question in particular. I'm hoping to avoid having to tur the whole GSL into a shared library and loading it just to use a few functions. Or avoid having to track down which functions my code will call (all the way down the trees) and rip that out to add to my own shared lib. There's got to be a better way to make use of big, useful libraries when speeding up python with shared lib extension. I hope. Maybe there are ways to do this using a gcc or g++ option. Right now my make file is simply gcc - bundle -flat_namespace -undefined suppress -o mycode.so mycode.o gcc -c mycode.c -o mycode.o Any hints appreciated. I will continue googling. Nothing so far. Thanks. -- Lou Pecora, my views are my own. --------------------------------- Never miss a thing. Make Yahoo your homepage.
Dear Lou, You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html). PyGSL interfaces with numpy, and it may have what you need. The trouble with calling GSL directly from ctypes is GSL's input and output is very structure-oriented, which complicates the python/C logic. Like you mentioned, GSL is pretty big, and you'd like to avoid loading it in its entirety. Even if you write your own shared library module that links against GSL, GSL might still get loaded in its entirety when your own shared library gets loaded. There is a mode argument with C's dlopen and the ctypes.CDLL function. dlopen supports a RTLD_LAZY flag (see man dlopen), which you can try passing to ctypes.CDLL, but I'm not sure what will happen since the ctypes documentation makes no mention of it. Here is a quick example of how to call a C function from python using ctypes. /** Sum num_elements numbers and return the result. **/ extern int myfunc(int *numbers, int num_elements) { int i, sum = 0; for (i = 0; i < num_elements; i++) { sum += numbers[i]; } return sum; } # First compile the source $ gcc -DNDEBUG -O2 -g -pipe -Wall -fstack-protector -D_GNU_SOURCE -fPIC -I/usr/lib/python2.5/site-packages/numpy/core/include -I/usr/include/python2.5 -c foo.c -o foo.o # Now link it. $ gcc -pthread -shared foo.o -L/usr/lib -lpython2.5 -o foo.so # Now run it in python. $ ipython In [1]: import numpy, ctypes # Load the shared library we just created In [2]: foo=ctypes.CDLL('./foo.so') # Create a numpy array of ints In [3]: A=numpy.array([1,2,3],dtype='int') # Set the return type we expect. In [4]: foo.myfunc.restype = ctypes.c_int # Call the C function. In [5]: print foo.myfunc(A.ctypes.data, ctypes.c_int(3)) 6 I hope this helps. Damian Lou Pecora wrote:
I will be writing some C code that I will compile into a shared library (.so) on my MacOSX computer to use with ctypes. That code will be calling code from a (big) scientific numerical library (Gnu Scientific Library - GSL) to crunch the numbers. But I don't see how I incorporate that code into the .so file so my shared code can get to it when I call it from Python with ctypes. I do _not_ want to make the GSL callable from Python, only from my own C module. I suspect this isn't a ctypes question in particular. I'm hoping to avoid having to tur the whole GSL into a shared library and loading it just to use a few functions. Or avoid having to track down which functions my code will call (all the way down the trees) and rip that out to add to my own shared lib. There's got to be a better way to make use of big, useful libraries when speeding up python with shared lib extension. I hope.
Maybe there are ways to do this using a gcc or g++ option. Right now my make file is simply
gcc - bundle -flat_namespace -undefined suppress -o mycode.so mycode.o
gcc -c mycode.c -o mycode.o
Any hints appreciated. I will continue googling. Nothing so far. Thanks.
-- Lou Pecora, my views are my own.
------------------------------------------------------------------------ Never miss a thing. Make Yahoo your homepage. <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
------------------------------------------------------------------------
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
Unfortunately, this does not work. Distutils only knows how to build python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same. cheers, David
David Cournapeau wrote:
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
Unfortunately, this does not work. Distutils only knows how to build python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David
Really? distutils generates .so files for me, which I assume are shared libraries. FYI: I'm running Fedora 8 on an x86. Does distutils not generate a shared library on a mac? Damian
On Feb 12, 2008 12:14 AM, Damian Eads <eads@soe.ucsc.edu> wrote:
David Cournapeau wrote:
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
Unfortunately, this does not work. Distutils only knows how to build python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David
Really? distutils generates .so files for me, which I assume are shared libraries. FYI: I'm running Fedora 8 on an x86. Does distutils not generate a shared library on a mac?
Python extension modules are shared libraries, yes. But they must follow a particular format, namely exposing a correct "init<modulename>" function. distutils/setuptools only maked Python extension modules, not arbitrary shared libraries. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Robert Kern wrote:
On Feb 12, 2008 12:14 AM, Damian Eads <eads@soe.ucsc.edu> wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html). Unfortunately, this does not work. Distutils only knows how to build
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote: python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David Really? distutils generates .so files for me, which I assume are shared
David Cournapeau wrote: libraries. FYI: I'm running Fedora 8 on an x86. Does distutils not generate a shared library on a mac?
Python extension modules are shared libraries, yes. But they must follow a particular format, namely exposing a correct "init<modulename>" function. distutils/setuptools only maked Python extension modules, not arbitrary shared libraries.
Perhaps I was a bit too liberal in my use of the term "extension module". Several small libraries for a project at work do not define the standard init<modulename> function, and yet they build with distutils. I can load them into ctypes without any hitches. Perhaps distutils does not check for the presence of the init<modulename> function and required data structures? I'll admit I may be abusing distutils by using it for something for which it wasn't designed. Damian
On Feb 12, 2008 12:41 AM, Damian Eads <eads@soe.ucsc.edu> wrote:
Robert Kern wrote:
On Feb 12, 2008 12:14 AM, Damian Eads <eads@soe.ucsc.edu> wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html). Unfortunately, this does not work. Distutils only knows how to build
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote: python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David Really? distutils generates .so files for me, which I assume are shared
David Cournapeau wrote: libraries. FYI: I'm running Fedora 8 on an x86. Does distutils not generate a shared library on a mac?
Python extension modules are shared libraries, yes. But they must follow a particular format, namely exposing a correct "init<modulename>" function. distutils/setuptools only maked Python extension modules, not arbitrary shared libraries.
Perhaps I was a bit too liberal in my use of the term "extension module". Several small libraries for a project at work do not define the standard init<modulename> function, and yet they build with distutils. I can load them into ctypes without any hitches. Perhaps distutils does not check for the presence of the init<modulename> function and required data structures? I'll admit I may be abusing distutils by using it for something for which it wasn't designed.
Yup. It usually works on Linux because the init<modulename> bit isn't checked. On Windows, it is, and the build will fail. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Feb 12, 2008 12:14 AM, Damian Eads <eads@soe.ucsc.edu> wrote:
David Cournapeau wrote:
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
Unfortunately, this does not work. Distutils only knows how to build python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David
Really? distutils generates .so files for me, which I assume are shared libraries. FYI: I'm running Fedora 8 on an x86. Does distutils not generate a shared library on a mac?
As to David's point, yes, distutils makes a .so shared library on Macs. This is not the same thing as a dynamic library (on Macs) which is what ctypes needs (on Macs), IIRC. There is a subtle, but important difference between the two. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Mon, 2008-02-11 at 23:14 -0700, Damian Eads wrote:
David Cournapeau wrote:
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
Unfortunately, this does not work. Distutils only knows how to build python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David
Really? distutils generates .so files for me, which I assume are shared libraries.
That's correct. But python extensions are not shared libraries; more exactly: some systems make the different between libraries which are loaded when the executable is launched (dynamic linking), and libraries which are loaded dynamically (dynamic loading, through dlopen), possible in the middle of the execution. All python extensions fall in the later category. It happens that on Linux (and most unices I know, mac os x being an exception), those are the same. But on mac os x (and windows), those are different: that's why you have .so and .dylib. dlopen does not really exist on mac os X, in the sense that it is a wrapper around the native linker/loader tools (and dlopen is not 100 % complete: you cannot unload/reload extensions I think on mac os X). Some of the differences are namespace (mac os X has the notion of a namespace for libraries, which I know nothing about; unices traditionally have a flat namespace for libraries), etc... T David
On Tue, 2008-02-12 at 16:05 +0900, David Cournapeau wrote:
On Mon, 2008-02-11 at 23:14 -0700, Damian Eads wrote:
David Cournapeau wrote:
On Mon, 2008-02-11 at 22:50 -0700, Damian Eads wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
Unfortunately, this does not work. Distutils only knows how to build python extensions, not shared libraries. Depending on the platform, this is not the same thing, and mac os X is such a platform where both are not the same.
cheers,
David
Really? distutils generates .so files for me, which I assume are shared libraries.
That's correct. But python extensions are not shared libraries; more exactly: some systems make the different between libraries which are loaded when the executable is launched (dynamic linking), and libraries which are loaded dynamically (dynamic loading, through dlopen), possible in the middle of the execution. All python extensions fall in the later category.
It happens that on Linux (and most unices I know, mac os x being an exception), those are the same. But on mac os x (and windows), those are different: that's why you have .so and .dylib. dlopen does not really exist on mac os X, in the sense that it is a wrapper around the native linker/loader tools (and dlopen is not 100 % complete: you cannot unload/reload extensions I think on mac os X). Some of the differences are namespace (mac os X has the notion of a namespace for libraries, which I know nothing about; unices traditionally have a flat namespace for libraries), etc...
If you are interested in knowing the difference between bundle and dynamic libraries (mac os X), here is some info: http://www.finkproject.org/doc/porting/porting.en.html#shared.lib-and-mod David
Damian, Lots of good info there. Thanks very much. -- Lou --- Damian Eads <eads@soe.ucsc.edu> wrote:
Dear Lou,
You may want to try using distutils or setuputils, which makes compiling extensions much easier. It does the hard work of finding out which flags are needed to compile extensions on the host platform. There are many examples on the web on how to use distutils to build C extensions (http://docs.python.org/ext/building.html).
[cut] -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
First, thanks to all who answered my questions about trying to use a large library with CTypes and my own shared library. The bottom line seems to be this: There is no way to incorporate code external to your own shared library. You have to either pull out the code you want from the static library's source (painful) or you must just include the whole library (huge!) and make it all one big shared library. Did I get that right? If so, it's a sad statement that makes shared libraries harder to write and works against the reuse of older established code bases. I am not criticizing CTypes. This appears to be the way static and shared libraries work, especially on Mac OS X, maybe elsewhere. I'd really like to be wrong about this and I will follow up on some of the suggested reading you all gave me. Thanks, again. -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
Hello, On Feb 12, 2008 4:25 PM, Lou Pecora <lou_boog2000@yahoo.com> wrote:
First, thanks to all who answered my questions about trying to use a large library with CTypes and my own shared library. The bottom line seems to be this: There is no way to incorporate code external to your own shared library. You have to either pull out the code you want from the static library's source (painful) or you must just include the whole library (huge!) and make it all one big shared library.
Did I get that right? If so, it's a sad statement that makes shared libraries harder to write and works against the reuse of older established code bases. I am not criticizing CTypes. This appears to be the way static and shared libraries work, especially on Mac OS X, maybe elsewhere.
I'd really like to be wrong about this and I will follow up on some of the suggested reading you all gave me.
I only quickly read through the previous thread, but I get that idea that what you want to do is to link your shared library against the the GSL shared library and then access your own library using ctypes. If done like this, you don't need to worry about wrapping GSL or pulling GSL code into your own library. As far as I know, this works exactly like it does when you link an executable against a shared library. If distutils doesn't allow you to do this easily, you could try using SCons's SharedLibrary builder instead. Regards, Albert
Albert Strasheim <fullung@gmail.com> wrote: Hello, I only quickly read through the previous thread, but I get that idea that what you want to do is to link your shared library against the the GSL shared library and then access your own library using ctypes. If done like this, you don't need to worry about wrapping GSL or pulling GSL code into your own library. As far as I know, this works exactly like it does when you link an executable against a shared library. If distutils doesn't allow you to do this easily, you could try using SCons's SharedLibrary builder instead. Regards, Albert _______________________________________________ Albert, Yes, I think you got the idea right. I want to call my own C code using CTypes interface, then from within my C code call GSL C code, i.e. a C function calling another C function directly. I do *not* want to go back out through the Python interface. So you are right, I do not want to wrap GSL. It sounds like I can just add something like -lnameofGSLdylib (where I put in the real name of the GSL library after the -l) in my gcc command to make my shared lib. Is that right? Thanks for your help. -- Lou Pecora, my views are my own. --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Hello, On Feb 12, 2008 6:19 PM, Lou Pecora <lou_boog2000@yahoo.com> wrote:
Albert,
Yes, I think you got the idea right. I want to call my own C code using CTypes interface, then from within my C code call GSL C code, i.e. a C function calling another C function directly. I do *not* want to go back out through the Python interface. So you are right, I do not want to wrap GSL.
It sounds like I can just add something like -lnameofGSLdylib (where I put in the real name of the GSL library after the -l) in my gcc command to make my shared lib. Is that right?
Sounds about right. I don't know the Mac that well as far as the various types of dynamic libraries go, so just check that you're working with the right type of libraries, but you've got the right idea. Regards, Albert
--- Albert Strasheim <fullung@gmail.com> wrote:
Hello,
Sounds about right. I don't know the Mac that well as far as the various types of dynamic libraries go, so just check that you're working with the right type of libraries, but you've got the right idea.
Regards,
Albert
Thanks, Albert. I'll report back to this thread when I give it a try. -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
Lou Pecora wrote:
... This appears to be the way static and shared libraries work, especially on Mac OS X, maybe elsewhere.
Have you tried linking against a GSL static library? I don't have a mac, but most linkers only pull in the routines you need. For example, using windows and mingw: #include <stdio.h> #include <gsl/gsl_sf_bessel.h> int main (void) { double x = 5.0; double y = gsl_sf_bessel_J0 (x); printf ("J0(%g) = %.18e\n", x, y); return 0; } ...compiles to a.exe which outputs: J0(5) = -1.775967713143382900e-001 The stripped executable is about 92 kB in comparison to the 2 mega byte libgsl.a. Unstripped there are about 150 symbols containing gsl, compared to 5351 symbols in the library libgsl.a. I just needed to put "-lgsl" on the command line and rename "$LIB/libgsl.dll.def" to something else so the shared version wasn't found. In this case the linker has not pulled in all of the library. Presumably just the parts it needed, including various things like error reporting, sin, cos, exp etc. Older platforms, including vax and various unix'es also seemed to behave in the same way in the past. Are you saying the mac is somehow different? Perhaps they're trying to hold people to "open source ransom", where they have to give away to source so it can be recompiled when the next OSX escapes ;-) Cheers, Jon
--- Jon Wright <wright@esrf.fr> wrote:
Lou Pecora wrote:
... This appears to be the way static and shared libraries work, especially on Mac OS X, maybe elsewhere.
Have you tried linking against a GSL static library? I don't have a mac, but most linkers only pull in the routines you need. For example, using windows and mingw:
#include <stdio.h> #include <gsl/gsl_sf_bessel.h> int main (void) { double x = 5.0; double y = gsl_sf_bessel_J0 (x); printf ("J0(%g) = %.18e\n", x, y); return 0; }
...compiles to a.exe which outputs:
J0(5) = -1.775967713143382900e-001
Yes, I know about this approach if I am making an executable. But I want to make my code into a shared library (my code will not have a main, just the functions I write) and, if possible, let my code call the GSL code it needs from the C function I write (i.e. no python interface). If what you did can be done for a shared library, then that would be great. However, I am ignorant of how to do this. I will try to make my shared library using gcc and then add the GSL library using the -l option as someone else suggested. Maybe that will work. I'll report back. I have been searching for info on the right approach to this on the Mac, since, as I understand, Mac OS X does make a distinction between shared libraries and dynamic libraries (which I don't understand fully). Thanks. -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
--- Jon Wright <wright@esrf.fr> wrote:
Lou Pecora wrote:
... This appears to be the way static and shared libraries work, especially on Mac OS X, maybe elsewhere. Have you tried linking against a GSL static library? I don't have a mac, but most linkers only pull in the routines you need. For example, using windows and mingw:
#include <stdio.h> #include <gsl/gsl_sf_bessel.h> int main (void) { double x = 5.0; double y = gsl_sf_bessel_J0 (x); printf ("J0(%g) = %.18e\n", x, y); return 0; }
...compiles to a.exe which outputs:
J0(5) = -1.775967713143382900e-001
Yes, I know about this approach if I am making an executable. But I want to make my code into a shared library (my code will not have a main, just the functions I write) and, if possible, let my code call the GSL code it needs from the C function I write (i.e. no python interface). If what you did can be done for a shared library, then that would be great. However, I am ignorant of how to do this. I will try to make my shared library using gcc and then add the GSL library using the -l option as someone else suggested. Maybe that will work. Oh, I may have misunderstood what you are trying to do then. You just want to call a shared library from another shared library ? This is
I'll report back. I have been searching for info on the right approach to this on the Mac, since, as I understand, Mac OS X does make a distinction between shared libraries and dynamic libraries (which I don't understand fully). To be over-simplistic: shared libraries are linked into the executable, and all its symbols (function, variables) are solved when you launch the executable. A dynamic library is not linked into the executable, and can be loaded at anytime during the execution of the executable. Shared
Lou Pecora wrote: possible on any platform supporting shared library (including but not limited to mac os x, windows, linux, most not ancient unices). As Albert said, just do (with gcc): gcc -shared -o mysharedlib mysharedlib.c -lgsl This works on mac os X as well as linux (and even windows with mingw). If you want to link the gsl statically (so that your own lib does not depend on the gsl anymore), you have use a trick to tell gcc to link the gsl: gcc -shared -o mysharedlib mysharedlib.c -Wl,-Bstatic -lgsl -Wl,-Bdynamic -Wl is used by gcc to pass option to the linker directly. -Bstatic says that all link options after will be static. You have to use -Bdynamic after, to avoid linking everything static (like gcc runtime, the C lib, which are automatically linked by default by gcc: it is almost always a bad idea to statically link those). In the first case, mysharedlib.so will need libgsl.so: ldd mysharedlib.so -> linux-gate.so.1 => (0xffffe000) libgsl.so.0 => /usr/lib/libgsl.so.0 (0xb7ddb000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7c91000) libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7c6b000) /lib/ld-linux.so.2 (0x80000000) In the second case: linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7e60000) /lib/ld-linux.so.2 (0x80000000) I don't know if the second method works on mac os X: since it bypass gcc and goes directly to the linker, which is notably different on mac os X, you may have to do it differently. library are "just" a way to avoid duplicate code, but are totally transparent to the code user: int foo() { return bar(); } If bar is in a shared lib (libbar.so) or in another object code (bar.o), it does not make a difference for you. With a dynamic lib, on most unices, you do hdl = dlopen("libbar.so") ((int)(*bar)()) = dlsym(hdl, "bar"); That is you explicitly load the functions you want to use. Without this scheme, you would have to link your extension to the python executable when python is built, which is totally impractical of course. IOW, dynamic libraries are used for "plugins", things which can be added to an executable *after* the executable is built. On linux (and other unices using the elf binary format), both types of libraries are built exactly the same. On mac os X (and windows as well), they are not. Again, this is oversimplification, but you don't need to know much more in almost all the cases. cheers, David
--- David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Oh, I may have misunderstood what you are trying to do then. You just want to call a shared library from another shared library ? This is possible on any platform supporting shared library (including but not limited to mac os x, windows, linux, most not ancient unices). [cut]
David, First, thanks very much for all the information. I am still digesting it, but you gave a clear explanation about the difference between shared and dynamic libraries on the Mac. I tried some of your compile/like commands, but the Mac gcc did not understand some things like -Bstatic and -shared. It seems to want to make bundles. I guess your code was a Linux version which the Mac doesn't like. But encouraged by your help, I got the following make file to work: # ---- Library make --------------------------- mysharedlib.so: mysharedlib.o mysharedlib.mak gcc -bundle -flat_namespace -undefined suppress -o mysharedlib.so mysharedlib.o \ fcnlib.a # ---- gcc C compile ------------------ mysharedlib.o: mysharedlib.c mysharedlib.h mysharedlib.mak gcc -c mysharedlib.c -o mysharedlib.o In the above fcnlib.a is a simple static library I made before using the above make. This created the shared library mysharedlib.so which I imported and handled with CTypes. Calling a function in fcnlib.a from python worked. A possible downside is that the shared library contains *all* of the fcnlib. I examined it using "nm mysharedlib.so". That showed that all the functions of fcnlib.a were present in mysharedlib.so even though the function in mysharedlib.c only called one function of fcnlib.a. I don't know how much of a burden this will impose at run time if do this with GSL. It would be nice to only pick up the stuff I need. But at least I have workable approach. Thanks for your help. Comments welcome. -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
On Feb 14, 2008 12:14 AM, Lou Pecora <lou_boog2000@yahoo.com> wrote:
--- David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Oh, I may have misunderstood what you are trying to do then. You just want to call a shared library from another shared library ? This is possible on any platform supporting shared library (including but not limited to mac os x, windows, linux, most not ancient unices). [cut]
David,
First, thanks very much for all the information. I am still digesting it, but you gave a clear explanation about the difference between shared and dynamic libraries on the Mac.
I tried some of your compile/like commands, but the Mac gcc did not understand some things like -Bstatic and -shared. It seems to want to make bundles. I guess your code was a Linux version which the Mac doesn't like.
Yes, I forgot that -shared does not work on mac os X. -Bstatic, being a linker option as I said, had little chance to work on mac os X. But encouraged by your help, I got the
# ---- Library make --------------------------- mysharedlib.so: mysharedlib.o mysharedlib.mak gcc -bundle -flat_namespace -undefined suppress -o mysharedlib.so mysharedlib.o \ fcnlib.a
# ---- gcc C compile ------------------ mysharedlib.o: mysharedlib.c mysharedlib.h mysharedlib.mak gcc -c mysharedlib.c -o mysharedlib.o
In the above fcnlib.a is a simple static library I made before using the above make. This created the shared library mysharedlib.so which I imported and handled with CTypes. Calling a function in fcnlib.a from python worked.
A possible downside is that the shared library contains *all* of the fcnlib. I examined it using "nm mysharedlib.so". That showed that all the functions of fcnlib.a were present in mysharedlib.so even though the function in mysharedlib.c only called one function of fcnlib.a. I don't know how much of a burden this will impose at run time if do this with GSL. It would be nice to only pick up the stuff I need. But at least I have workable approach.
This is logical, or more exactly, the tools did what it thinks you are asking: putting the archive (the .a library) in your executable. If instead of libfnclib.a, you just put -lfcnlib in the command line, it will pick up only the necessary code for the functions you are calling (at least it does on linux if I remember correctly). But the real question is : if you are concerned with code bload, why using static lib at all ? Why not using shared library, which is exactly designed to solve what you are trying to do ? cheers, David
--- David Cournapeau <cournape@gmail.com> wrote:
But the real question is : if you are concerned with code bload, why using static lib at all ? Why not using shared library, which is exactly designed to solve what you are trying to do ? cheers, David
Yes, a good question. Two reasons I started off with the static library. One is that Gnu instructions claimed the dynamic library did not always build properly on the Mac OS X. So I just built the static GSL and figured if I got that to link up to my code, I could then spend some time trying the dynamic build. The other reason is that I am just learning this and I am probably backing into the "right" way to do this rather than starting right off with the right way. Maybe my worries about bloat and (even more) time to load are not important for the GSL and the code will load fast enough and not take up too much in resources to matter. Later today I will try to build the dynamic version of GSL and see what that yields. If I get it I will link to that as you suggest. Thanks, again. Your suggestions have moved me along nicely. -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
On Wed, 2008-02-13 at 08:20 -0800, Lou Pecora wrote:
--- David Cournapeau <cournape@gmail.com> wrote:
But the real question is : if you are concerned with code bload, why using static lib at all ? Why not using shared library, which is exactly designed to solve what you are trying to do ? cheers, David
Yes, a good question. Two reasons I started off with the static library. One is that Gnu instructions claimed the dynamic library did not always build properly on the Mac OS X.
If true, that's a good argument. I don't know the state of libtool of mac os X (the part of autotools which deals with building libraries in a cross platform way). Given the history of apple with open source, I would not be surprised if the general support was subpar compared to other unices.
So I just built the static GSL and figured if I got that to link up to my code, I could then spend some time trying the dynamic build. The other reason is that I am just learning this and I am probably backing into the "right" way to do this rather than starting right off with the right way. Maybe my worries about bloat and (even more) time to load are not important for the GSL and the code will load fast enough and not take up too much in resources to matter.
I don't know what kind of applications you are developing, but taking care of the time to load the application because of the huge number of symbols seems like really premature optimization to me. That's the kind of problems you don't see if your applications are not huge (or developed with C++, which put a huge pressure on the linker/loader tools by its very nature). Also, note that all modern OS (this includes even windows since NT) do not load the whole shared library in memory, and that two applications needing the GSL will share the same version in memory. The same "physical page" of a shared library can be "mapped" into different address spaces (for different processes). I use "", because that's a huge over-simplification, and that's where it reaches my own understanding of the thing. This sharing cannot happen for static libraries. cheers, David
--- David Cournapeau <cournapeau@cslab.kecl.ntt.co.jp> wrote:
On Wed, 2008-02-13 at 08:20 -0800, Lou Pecora wrote:
Yes, a good question. Two reasons I started off with the static library. One is that Gnu instructions claimed the dynamic library did not always build properly on the Mac OS X.
If true, that's a good argument. I don't know the state of libtool of mac os X (the part of autotools which deals with building libraries in a cross platform way). Given the history of apple with open source, I would not be surprised if the general support was subpar compared to other unices.
I ran the GSL install again and allowed the dynamic lib to be built. It worked fine so maybe Apple has done better lately. I should have tried it right from the beginning. Anyway, I have a make script that seems to work find and I can link my shared lib to the GSL library or, I would guess, any other library I have. Thanks, again for your helpful suggestions. I will put up my code (it's small) for others to see on this list in a separate message. That might help others not to make the same mistakes I did.
I don't know what kind of applications you are developing, but taking care of the time to load the application because of the huge number of symbols seems like really premature optimization to me. That's the kind of problems you don't see if your applications are not huge (or developed with C++, which put a huge pressure on the linker/loader tools by its very nature).
Also, note that all modern OS (this includes even windows since NT) do not load the whole shared library in memory, and that two applications needing the GSL will share the same version in memory. The same "physical page" of a shared library can be "mapped" into different address spaces (for different processes). I use "", because that's a huge over-simplification, and that's where it reaches my own understanding of the thing. This sharing cannot happen for static libraries.
Good advice. You are right. My code is not large and it loads fast. -- Lou Pecora, my views are my own. ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
A Monday 11 February 2008, Charles R Harris escrigué:
I've attached my working _sortmodule.c.src file so you can fool with these different changes on your machines also. This is on top of current svn.
Ok. In order to compare pears with pears, I've decided to create a standalone program in C (attached), based on your version (yes, it is almost the same that the one that I came up with). This also allows to run it quickly in as many platforms as possible. The compiler throws some warnings, but they are not important (I think). Here are the results of running it in several platforms: 1) My laptop: Ubuntu 7.1 (gcc 4.1.3, Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000 2) My laptop: Windows XP (MSVC 7.1, Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.971000 C qsort with Python style compare: 0.962000 NumPy newqsort: 0.921000 3) An Opteron server: SuSe 10.1 (gcc 4.2.1, Opteron @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000 Some of the conclusions that can be drawn: * C qsort performs pretty badly on my Pentium4 laptop with Ubuntu * C qsort on Win on my laptop performs very similar to newqsort * newqsort performs much better on my Ubuntu Linux than in Windows * On Opteron, C qsort and newqsort do perform very similarly * and most importantly, newqsort runs faster in *all* platforms So, provided the last conclusion, I think it is safe to check newqsort in NumPy (unless something catastrofic might occur on other platforms). Finally, a couple of small things: * MSVC doesn't swallow the "inline" qualifier. So we should remove it and hope that most of NumPy installations will be compiled -O3 at least. * I'd definitely keep memcpy by default. From my timings, it looks like the best option for all platforms. I hope the benchmark will behave well in your platform too (i.e. newqsort will perform the best ;) Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 11, 2008 1:15 PM, Francesc Altet <faltet@carabos.com> wrote:
A Monday 11 February 2008, Charles R Harris escrigué:
I've attached my working _sortmodule.c.src file so you can fool with these different changes on your machines also. This is on top of current svn.
Ok. In order to compare pears with pears, I've decided to create a standalone program in C (attached), based on your version (yes, it is almost the same that the one that I came up with). This also allows to run it quickly in as many platforms as possible. The compiler throws some warnings, but they are not important (I think).
Here are the results of running it in several platforms:
1) My laptop: Ubuntu 7.1 (gcc 4.1.3, Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000
Wow, what a difference.
2) My laptop: Windows XP (MSVC 7.1, Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.971000 C qsort with Python style compare: 0.962000 NumPy newqsort: 0.921000
3) An Opteron server: SuSe 10.1 (gcc 4.2.1, Opteron @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000
Some of the conclusions that can be drawn:
* C qsort performs pretty badly on my Pentium4 laptop with Ubuntu * C qsort on Win on my laptop performs very similar to newqsort * newqsort performs much better on my Ubuntu Linux than in Windows * On Opteron, C qsort and newqsort do perform very similarly * and most importantly, newqsort runs faster in *all* platforms
So, provided the last conclusion, I think it is safe to check newqsort in NumPy (unless something catastrofic might occur on other platforms).
Finally, a couple of small things:
* MSVC doesn't swallow the "inline" qualifier. So we should remove it and hope that most of NumPy installations will be compiled -O3 at least.
I was afraid of that. The inline keyword is a fairly new standard; gcc has had it for a while but the older versions of MSVC didn't. I don't know if the newer MSVC versions do. IIRC, there was another way to get MSVC to inline. Of course, we could go to C++ :0)
* I'd definitely keep memcpy by default. From my timings, it looks like the best option for all platforms.
OK. Was that just for the copies, or was it for the swaps also? I ran a version of swap using memcpy on my machine and the sort was about half as fast for 8 character strings.
I hope the benchmark will behave well in your platform too (i.e. newqsort will perform the best ;)
I'll check it out when I get home. As I say, it was running about 10% slower on my machine, but if it does better on most platforms it is probably the way to go. We can always change it in the future when everyone is running on quantum computers. Chuck
A Monday 11 February 2008, Charles R Harris escrigué:
On Feb 11, 2008 1:15 PM, Francesc Altet <faltet@carabos.com> wrote:
Here are the results of running it in several platforms:
1) My laptop: Ubuntu 7.1 (gcc 4.1.3, Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000
Wow, what a difference.
Yeah. This is why I got so excited initially. It's unfortunate that most of the speed-up in newqsort in this case is probably due to a possible flaw in qsort implementation for the combination Ubuntu/Pentium4. On the positive side, it is nice to see that other distros/processors have a decent performance on system qsort ;-)
* I'd definitely keep memcpy by default. From my timings, it looks like the best option for all platforms.
OK. Was that just for the copies, or was it for the swaps also? I ran a version of swap using memcpy on my machine and the sort was about half as fast for 8 character strings.
No, only for the copies. For the swaps, this memcpy-based version: #define swap_string(s1, s2, len) { \ memcpy((vp), (s2), (len)); \ memcpy((s2), (s1), (len)); \ memcpy((s1), (vp), (len)); \ } performs extremely bad on my systems: Pentium4/Ubuntu 7.10: * newqsort with the loop version of swap_string: 0.65 s * newqsort with the memcpy version of swap_string: 9.14 s Opteron/SuSe LE 10.3: * newqsort with the loop version of swap_string: 0.59 s * newqsort with the memcpy version of swap_string: 8.71 s So, it seems that the nice newqsort performance is extremely dependent on the loop version of swap_string.
I hope the benchmark will behave well in your platform too (i.e. newqsort will perform the best ;)
I'll check it out when I get home. As I say, it was running about 10% slower on my machine, but if it does better on most platforms it is probably the way to go. We can always change it in the future when everyone is running on quantum computers.
Quantum computers? Oh, I can't wait for mine ;-) --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
A Monday 11 February 2008, Charles R Harris escrigué:
I'll check it out when I get home. As I say, it was running about 10% slower on my machine, but if it does better on most platforms it is probably the way to go. We can always change it in the future when everyone is running on quantum computers.
We've done some testing on newqsort in several computers in our company. Here are the results for ordering a list with 1 million of strings of length 15 filled with random information (using C rand()): 1) Ubuntu 7.1 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2 GHz) C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000 2) Windows XP (SP2) (MSVC 7.1, /Ox, Intel Pentium 4 @ 2 GHz) C qsort with C style compare: 0.971000 C qsort with Python style compare: 0.962000 NumPy newqsort: 0.921000 3) SuSe LE 10.3 (gcc 4.2.1, -O3, AMD Opteron @ 2 GHz) C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000 4) Debian 4.2.2 (lenny) (gcc 4.2.3, -O3, Intel Pentium 4 @ 3.2 GHz) C qsort with C style compare: 1.770000 C qsort with Python style compare: 1.750000 NumPy newqsort: 0.440000 5) Mandriva 2008.0 (gcc 4.2.2, -O3, Intel Core2 Duo @ 1.5 GHz) C qsort with C style compare: 1.590000 C qsort with Python style compare: 1.550000 NumPy newqsort: 0.510000 6) Ubuntu 7.1 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2.5 GHz) C qsort with C style compare: 1.890000 C qsort with Python style compare: 1.900000 NumPy newqsort: 0.500000 7) Ubuntu 7.1 (gcc 4.1.2, -O3, PowerPC 3 @ 1.3 GHz) C qsort with C style compare: 3.030000 C qsort with Python style compare: 2.970000 NumPy newqsort: 1.040000 8) MacOSX 10.4 (Tiger) (gcc 4.0.1, -O3, PowerPC 3 @ 1.3 GHz) C qsort with C style compare: 1.560000 C qsort with Python style compare: 1.510000 NumPy newqsort: 1.220000 All benchmarks have been run using the attached benchmark (if anybody else wants to join the fiesta, please report back your feedback). Summarizing, one can say a couple of things: * Recent Debian distros and derivatives (Ubuntu) as well as Mandriva are suffering from a innefficient system qsort (at least the implementation for strings). SuSe Linux Enterprise 10.3 seems to have solved this. And Windows XP (SP2) and MacOSX (Tiger) looks like they have a relatively efficient implementation of qsort. * The newqsort performs the best on all the platforms we have checked (ranging from a 5% of improvement on Opteron/SuSe, up to 3.8x with some Pentium4/Ubuntu systems). All in all, I'd also say that newqsort would be a good candidate to be put into NumPy. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 12, 2008 9:07 AM, Francesc Altet <faltet@carabos.com> wrote:
A Monday 11 February 2008, Charles R Harris escrigué:
I'll check it out when I get home. As I say, it was running about 10% slower on my machine, but if it does better on most platforms it is probably the way to go. We can always change it in the future when everyone is running on quantum computers.
We've done some testing on newqsort in several computers in our company. Here are the results for ordering a list with 1 million of strings of length 15 filled with random information (using C rand()):
1) Ubuntu 7.1 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2 GHz) C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000
2) Windows XP (SP2) (MSVC 7.1, /Ox, Intel Pentium 4 @ 2 GHz) C qsort with C style compare: 0.971000 C qsort with Python style compare: 0.962000 NumPy newqsort: 0.921000
3) SuSe LE 10.3 (gcc 4.2.1, -O3, AMD Opteron @ 2 GHz) C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000
4) Debian 4.2.2 (lenny) (gcc 4.2.3, -O3, Intel Pentium 4 @ 3.2 GHz) C qsort with C style compare: 1.770000 C qsort with Python style compare: 1.750000 NumPy newqsort: 0.440000
5) Mandriva 2008.0 (gcc 4.2.2, -O3, Intel Core2 Duo @ 1.5 GHz) C qsort with C style compare: 1.590000 C qsort with Python style compare: 1.550000 NumPy newqsort: 0.510000
6) Ubuntu 7.1 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2.5 GHz) C qsort with C style compare: 1.890000 C qsort with Python style compare: 1.900000 NumPy newqsort: 0.500000
7) Ubuntu 7.1 (gcc 4.1.2, -O3, PowerPC 3 @ 1.3 GHz) C qsort with C style compare: 3.030000 C qsort with Python style compare: 2.970000 NumPy newqsort: 1.040000
8) MacOSX 10.4 (Tiger) (gcc 4.0.1, -O3, PowerPC 3 @ 1.3 GHz) C qsort with C style compare: 1.560000 C qsort with Python style compare: 1.510000 NumPy newqsort: 1.220000
All benchmarks have been run using the attached benchmark (if anybody else wants to join the fiesta, please report back your feedback).
Summarizing, one can say a couple of things:
* Recent Debian distros and derivatives (Ubuntu) as well as Mandriva are suffering from a innefficient system qsort (at least the implementation for strings). SuSe Linux Enterprise 10.3 seems to have solved this. And Windows XP (SP2) and MacOSX (Tiger) looks like they have a relatively efficient implementation of qsort.
* The newqsort performs the best on all the platforms we have checked (ranging from a 5% of improvement on Opteron/SuSe, up to 3.8x with some Pentium4/Ubuntu systems).
The 3.8 is amazing, isn't it? I've found that the performance also depends on whether I initialize the strings with random or empty. With the random initialization the new sort is ~2x faster. That's fedora 8, core duo, 32 bit OS, gcc 4.1.2.
All in all, I'd also say that newqsort would be a good candidate to be put into NumPy.
I've merged some sorting tests preparatory to merging the new sorts. There is a release coming up this weekend, I don't know if it is tagged yet, but in any case I plan to merge the new sorts soon. Please help with the testing when I do. Now it's off to paying work. Chuck
A Tuesday 12 February 2008, Charles R Harris escrigué:
On Feb 12, 2008 9:07 AM, Francesc Altet <faltet@carabos.com> wrote:
* The newqsort performs the best on all the platforms we have checked (ranging from a 5% of improvement on Opteron/SuSe, up to 3.8x with some Pentium4/Ubuntu systems).
The 3.8 is amazing, isn't it? I've found that the performance also depends on whether I initialize the strings with random or empty. With the random initialization the new sort is ~2x faster. That's fedora 8, core duo, 32 bit OS, gcc 4.1.2.
Well, for me, a 3.8x (or even a 2x for that matter) of improvement is less amazing once you know that there is a flaw in C qsort for most of Linux distros around. Neither Windows, MacOSX or certain versions of Linux (namely, SuSe 10.3) does reflect such a large difference in performance. I'd say that newqsort that is to be included in next release of NumPy would be just a 10% better than using a sane implementation of C qsort. And, while a 10% is not as amazing than a 380%, the great news is that newqsort will provide first-class performance even to people having the flawed C qsort on their machines (which apparently are much more that I initially realized). Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Hi, I have a Opteron 248 (2.66GHz) that with gcc 4.1.0 (SUSE10.1?) that gives C qsort with C style compare: 0.650000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000 I did notice that -O3 was essential to get the performance gain as -O2 gave: C qsort with C style compare: 0.690000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.610000 Bruce On Feb 12, 2008 10:07 AM, Francesc Altet <faltet@carabos.com> wrote:
A Monday 11 February 2008, Charles R Harris escrigué:
I'll check it out when I get home. As I say, it was running about 10% slower on my machine, but if it does better on most platforms it is probably the way to go. We can always change it in the future when everyone is running on quantum computers.
We've done some testing on newqsort in several computers in our company. Here are the results for ordering a list with 1 million of strings of length 15 filled with random information (using C rand()):
1) Ubuntu 7.1 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2 GHz) C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000
2) Windows XP (SP2) (MSVC 7.1, /Ox, Intel Pentium 4 @ 2 GHz) C qsort with C style compare: 0.971000 C qsort with Python style compare: 0.962000 NumPy newqsort: 0.921000
3) SuSe LE 10.3 (gcc 4.2.1, -O3, AMD Opteron @ 2 GHz) C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000
4) Debian 4.2.2 (lenny) (gcc 4.2.3, -O3, Intel Pentium 4 @ 3.2 GHz) C qsort with C style compare: 1.770000 C qsort with Python style compare: 1.750000 NumPy newqsort: 0.440000
5) Mandriva 2008.0 (gcc 4.2.2, -O3, Intel Core2 Duo @ 1.5 GHz) C qsort with C style compare: 1.590000 C qsort with Python style compare: 1.550000 NumPy newqsort: 0.510000
6) Ubuntu 7.1 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2.5 GHz) C qsort with C style compare: 1.890000 C qsort with Python style compare: 1.900000 NumPy newqsort: 0.500000
7) Ubuntu 7.1 (gcc 4.1.2, -O3, PowerPC 3 @ 1.3 GHz) C qsort with C style compare: 3.030000 C qsort with Python style compare: 2.970000 NumPy newqsort: 1.040000
8) MacOSX 10.4 (Tiger) (gcc 4.0.1, -O3, PowerPC 3 @ 1.3 GHz) C qsort with C style compare: 1.560000 C qsort with Python style compare: 1.510000 NumPy newqsort: 1.220000
All benchmarks have been run using the attached benchmark (if anybody else wants to join the fiesta, please report back your feedback).
Summarizing, one can say a couple of things:
* Recent Debian distros and derivatives (Ubuntu) as well as Mandriva are suffering from a innefficient system qsort (at least the implementation for strings). SuSe Linux Enterprise 10.3 seems to have solved this. And Windows XP (SP2) and MacOSX (Tiger) looks like they have a relatively efficient implementation of qsort.
* The newqsort performs the best on all the platforms we have checked (ranging from a 5% of improvement on Opteron/SuSe, up to 3.8x with some Pentium4/Ubuntu systems).
All in all, I'd also say that newqsort would be a good candidate to be put into NumPy.
Cheers,
--
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On Feb 12, 2008 11:53 AM, Bruce Southey <bsouthey@gmail.com> wrote:
Hi,
I have a Opteron 248 (2.66GHz) that with gcc 4.1.0 (SUSE10.1?) that gives C qsort with C style compare: 0.650000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
I did notice that -O3 was essential to get the performance gain as -O2 gave: C qsort with C style compare: 0.690000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.610000
Try -O2 -finline-functions, it should come in somewhere between -O2 and -O3 Chuck
A Wednesday 13 February 2008, Charles R Harris escrigué:
OK,
The new quicksorts are in svn. Francesc, can you check them out?
Looks good here. However, you seem to keep using your own copy_string() instead of plain memcpy(). In previous benchmarks, I've seen that copy_string() is faster than memcpy only for small values of the length of the block to be copied. In order to do a better assessment of this affirmation, I've created a small benchmark (attached) in order to compare your copy_string against system memcpy when sorting array strings. I've also come up with a new function (copy_string2, attached), that tries to combine the best of copy_string and memcpy. Look at the attached plot in order to see how they behave. In the plot, you can see how memcpy is generally faster than copy_string, specially for relatively large string lengths. However, copy_string can be faster for small lengths (the maximum difference, a 20%, happens around 8/10 bytes). It may happen that you were doing time mesaurements whith strings of size 8, and you were driven to the wrong conclusion that copy_string was faster than memcpy. In case you think that performance for small string lengths is important, you may want to include copy_string2, that uses one method or another depending on the size block to be copied. There is of course a small impact in performance (one more condition test was introduced), but it is rather negligible. Finally, you also will have noticed the indirect sort line in the plot. This is because I was curious about when this method would win a direct sort. And, by looking at the plot, it seems that the crosspoint is around strings of 128 bytes (much more in fact that I initially thought), and starts to be very significant (around 40% faster) at 256 bytes. So perhaps it would make sense to add the possibility to choose the indirect method when sorting those large strings. This, of course, would require more memory for the indices, but using 4 or 8 additional bytes (depending if we on 32-bit or 64-bit), when each string takes 200 bytes, doesn't seem too crazy. In any case, it would be nice to document this in docstrings. Be warned, I'd like to stress out that these are my figures for my _own laptop_. It would be nice if you can verify all of this with other achitectures (your Core2 machine seems different enough). I can run the benchmarks on Windows (installed in the same laptop) too. Tell me if you are interested on me doing this. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 13, 2008 10:56 AM, Francesc Altet <faltet@carabos.com> wrote:
A Wednesday 13 February 2008, Charles R Harris escrigué:
OK,
The new quicksorts are in svn. Francesc, can you check them out?
Looks good here. However, you seem to keep using your own copy_string() instead of plain memcpy(). In previous benchmarks, I've seen that copy_string() is faster than memcpy only for small values of the length of the block to be copied.
Yes, I noticed that your benchmark program crossed over to using memcpy at 16 chars, and I will probably add that feature. I was being conservative to start with. <snip>
Finally, you also will have noticed the indirect sort line in the plot. This is because I was curious about when this method would win a direct sort. And, by looking at the plot, it seems that the crosspoint is around strings of 128 bytes (much more in fact that I initially thought), and starts to be very significant (around 40% faster) at 256 bytes. So perhaps it would make sense to add the possibility to choose the indirect method when sorting those large strings. This, of course, would require more memory for the indices, but using 4 or 8 additional bytes (depending if we on 32-bit or 64-bit), when each string takes 200 bytes, doesn't seem too crazy. In any case, it would be nice to document this in docstrings.
It would be easy to add this feature, but for the moment I think the best thing is to document it. Another fairly easy change that could be made is to support strided arrays. That might speed sorting of non-contiguous arrays and sorts on axis other than -1. The only reason it isn't there now is that I originally wrote the sorting routines for numarray and numarray's upper level interface passed contiguous arrays to the sort functions.
Be warned, I'd like to stress out that these are my figures for my _own laptop_. It would be nice if you can verify all of this with other achitectures (your Core2 machine seems different enough). I can run the benchmarks on Windows (installed in the same laptop) too. Tell me if you are interested on me doing this.
Its easy enough to test if you compile from svn, just add your new copy function and change the name in this line: #copy=copy_string, copy_ucs4# to use your function instead of copy_string. Chuck
A Tuesday 12 February 2008, Bruce Southey escrigué:
Hi,
I have a Opteron 248 (2.66GHz) that with gcc 4.1.0 (SUSE10.1?) that gives C qsort with C style compare: 0.650000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
That's very intersting. In a similar configuration, but using SuSe 10.3 (Enterprise) instead of 10.1, I don't see this factor of almost 2 of difference in performance (in fact, both performances, C qsort and NumPy newqsort, are very similar). This seems to confirm that the GNU glibc crew has fixed the qsort performance very recently (i.e. I hope it is not only a fix in SuSe 10.3 Enterprise), and this is why most of current distros are seeing the poor performance in qsort. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Hi, I added gcc 4.2 from the openSUSE 10.1 repository so I now have both the 4.1.2 and 4.2.1 compilers installed. But still have glibc-2.4-31.1 installed. I see your result with 4.2.1 but not with 4.1.2 so I think that there could be a difference in the compiler flags. I don't know enough about those to help but I can test any suggestions. $ gcc --version gcc (GCC) 4.1.2 20070115 (prerelease) (SUSE Linux) $ gcc -O3 sort-string-bench.c -o sort412 $ ./sort412 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.630000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000 $ gcc-4.2 --version gcc-4.2 (GCC) 4.2.1 (SUSE Linux) $ gcc-4.2 -O3 sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.550000 This is the same as: $ gcc-4.2 -O2 -finline-functions sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.710000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.550000 (NumPy newqsort with -O2 alone is 0.60000) For completeness, 4.1.2 using '-O2' versus '-O2 -finline-functions' is NumPy newqsort: 0.620000 vs NumPy newqsort: 0.500000 Regards Bruce On Feb 13, 2008 4:35 AM, Francesc Altet <faltet@carabos.com> wrote:
A Tuesday 12 February 2008, Bruce Southey escrigué:
Hi,
I have a Opteron 248 (2.66GHz) that with gcc 4.1.0 (SUSE10.1?) that gives C qsort with C style compare: 0.650000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
That's very intersting. In a similar configuration, but using SuSe 10.3 (Enterprise) instead of 10.1, I don't see this factor of almost 2 of difference in performance (in fact, both performances, C qsort and NumPy newqsort, are very similar).
This seems to confirm that the GNU glibc crew has fixed the qsort performance very recently (i.e. I hope it is not only a fix in SuSe 10.3 Enterprise), and this is why most of current distros are seeing the poor performance in qsort.
Cheers,
--
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
On Feb 13, 2008 9:19 AM, Bruce Southey <bsouthey@gmail.com> wrote:
Hi, I added gcc 4.2 from the openSUSE 10.1 repository so I now have both the 4.1.2 and 4.2.1 compilers installed. But still have glibc-2.4-31.1 installed. I see your result with 4.2.1 but not with 4.1.2 so I think that there could be a difference in the compiler flags. I don't know enough about those to help but I can test any suggestions.
$ gcc --version gcc (GCC) 4.1.2 20070115 (prerelease) (SUSE Linux) $ gcc -O3 sort-string-bench.c -o sort412 $ ./sort412 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.630000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
$ gcc-4.2 --version gcc-4.2 (GCC) 4.2.1 (SUSE Linux) $ gcc-4.2 -O3 sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.550000
This is the same as: $ gcc-4.2 -O2 -finline-functions sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.710000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.550000
(NumPy newqsort with -O2 alone is 0.60000)
For completeness, 4.1.2 using '-O2' versus '-O2 -finline-functions' is NumPy newqsort: 0.620000 vs NumPy newqsort: 0.500000
It's certainly interesting how much difference the compiler/flags make. I suppose one more thing to add to the mix is the -march and -mtune flags. They didn't make much difference here, but they might for someone else. It would also be interesting to see how a 64 bit system handles things. /lib/libgcc_s-4.1.2 gcc 4.1.2, -O3 -march=prescott -mtune=generic Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.940000 C qsort with Python style compare: 0.940000 NumPy newqsort: 0.310000 I suppose much also depends on the flags with which libgcc is compiled, which in turn probably depends on the distro. Chuck
A Wednesday 13 February 2008, Bruce Southey escrigué:
Hi, I added gcc 4.2 from the openSUSE 10.1 repository so I now have both the 4.1.2 and 4.2.1 compilers installed. But still have glibc-2.4-31.1 installed. I see your result with 4.2.1 but not with 4.1.2 so I think that there could be a difference in the compiler flags. I don't know enough about those to help but I can test any suggestions.
$ gcc --version gcc (GCC) 4.1.2 20070115 (prerelease) (SUSE Linux) $ gcc -O3 sort-string-bench.c -o sort412 $ ./sort412 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.630000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
$ gcc-4.2 --version gcc-4.2 (GCC) 4.2.1 (SUSE Linux) $ gcc-4.2 -O3 sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.550000
This is the same as: $ gcc-4.2 -O2 -finline-functions sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.710000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.550000
(NumPy newqsort with -O2 alone is 0.60000)
For completeness, 4.1.2 using '-O2' versus '-O2 -finline-functions' is NumPy newqsort: 0.620000 vs NumPy newqsort: 0.500000
That's really interesting. Let me remember my figures for our Opteron: 3) SuSe LE 10.3 (gcc 4.2.1, -O3, AMD Opteron @ 2 GHz) C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000 Yours is running at a clock rate of 2.66 GHz and mine at 2 GHz, so that makes yours a 33% faster. With this, I'd expect, for newqsort in your machine and using gcc 4.2.1 something like 0.59/1.33 = 0.44s. Of course, this is not the case, and you are getting 0.55s, which is only a 7% faster. This mostly reflects the fact that newqsort is bounded by other things than CPU speed (most probably, memory latency, but I might be wrong). But the most important thing is that it turns out that gcc 4.2.1 is doing a much worse job in optimizing newqsort than gcc 4.1.2 (at least on Opterons). That is unfortunate, because the similar performance of C qsort and newqsort on SuSe 10.3 made me think that it was due to the fact that SuSe speed-up the C qsort. I see now that this is not the case, and the problem seems that gcc 4.2.1 is not able to optimize newqsort as much as 4.1.2. So, it is becoming more and more clear that newqsort is potentially much faster that C qsort: you have seen a 2x of speedup, Chuck a 3x and me up to a 3.8x. The only issue seems to find a good enough compiler (or find the correct flags) to take advantage of all of its potential, which doesn't seem an easy thing indeed :-/ Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
A Wednesday 13 February 2008, Francesc Altet escrigué:
A Wednesday 13 February 2008, Bruce Southey escrigué:
Hi, I added gcc 4.2 from the openSUSE 10.1 repository so I now have both the 4.1.2 and 4.2.1 compilers installed. But still have glibc-2.4-31.1 installed. I see your result with 4.2.1 but not with 4.1.2 so I think that there could be a difference in the compiler flags. I don't know enough about those to help but I can test any suggestions.
$ gcc --version gcc (GCC) 4.1.2 20070115 (prerelease) (SUSE Linux) $ gcc -O3 sort-string-bench.c -o sort412 $ ./sort412 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.630000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
$ gcc-4.2 --version gcc-4.2 (GCC) 4.2.1 (SUSE Linux) $ gcc-4.2 -O3 sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.550000
This is the same as: $ gcc-4.2 -O2 -finline-functions sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.710000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.550000
(NumPy newqsort with -O2 alone is 0.60000)
For completeness, 4.1.2 using '-O2' versus '-O2 -finline-functions' is NumPy newqsort: 0.620000 vs NumPy newqsort: 0.500000
That's really interesting. Let me remember my figures for our Opteron:
3) SuSe LE 10.3 (gcc 4.2.1, -O3, AMD Opteron @ 2 GHz) C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000
Just an addedum. I've compiled the benchmark using gcc 4.1.2 using our Opteron machine. Here are the results: SuSe LE 10.3 (gcc 4.1.2, -O3, AMD Opteron @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.380000 So, I'm getting a 55% more of performance than by using gcc 4.2.1 (!). Also, I've installed gcc 4.2.1 on my laptop and here are the results: Ubuntu 7.10 (gcc 4.2.1, -O3, Intel Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.420000 NumPy newqsort: 0.630000 While using gcc 4.1.2, I get: Ubuntu 7.10 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000 So, in this case (32-bit platform) gcc 4.2.1 seems to perform similarly to 4.1.2. So, I'd say that the guilty is the gcc 4.2.1, 64-bit (or at very least, AMD Opteron architecture) and that newqsort performs really well in general (provided that the compiler can find the best path for optimizing its code). Anyone using a 64-bit platform and having both gcc 4.1.2 and 4.2.1 installed can confirm this? Also, MSVC 7.1 32-bit (with opt level /Ox) doesn't seem to find such a best path (the benchmark for newqsort takes 0.92s using MSVC 7.1, while gcc 4.1.2 takes 0.65s using the same machine, a 40% faster). I don't know whether newer versions of MSVC will do better or not, though. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 13, 2008 12:37 PM, Francesc Altet <faltet@carabos.com> wrote:
A Wednesday 13 February 2008, Francesc Altet escrigué:
A Wednesday 13 February 2008, Bruce Southey escrigué:
Hi, I added gcc 4.2 from the openSUSE 10.1 repository so I now have both the 4.1.2 and 4.2.1 compilers installed. But still have glibc-2.4-31.1 installed. I see your result with 4.2.1 but not with 4.1.2 so I think that there could be a difference in the compiler flags. I don't know enough about those to help but I can test any suggestions.
$ gcc --version gcc (GCC) 4.1.2 20070115 (prerelease) (SUSE Linux) $ gcc -O3 sort-string-bench.c -o sort412 $ ./sort412 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.630000 C qsort with Python style compare: 0.640000 NumPy newqsort: 0.360000
$ gcc-4.2 --version gcc-4.2 (GCC) 4.2.1 (SUSE Linux) $ gcc-4.2 -O3 sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.550000
This is the same as: $ gcc-4.2 -O2 -finline-functions sort-string-bench.c -o sort421 $ ./sort421 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.710000 C qsort with Python style compare: 0.700000 NumPy newqsort: 0.550000
(NumPy newqsort with -O2 alone is 0.60000)
For completeness, 4.1.2 using '-O2' versus '-O2 -finline-functions' is NumPy newqsort: 0.620000 vs NumPy newqsort: 0.500000
That's really interesting. Let me remember my figures for our Opteron:
3) SuSe LE 10.3 (gcc 4.2.1, -O3, AMD Opteron @ 2 GHz) C qsort with C style compare: 0.640000 C qsort with Python style compare: 0.600000 NumPy newqsort: 0.590000
Just an addedum. I've compiled the benchmark using gcc 4.1.2 using our Opteron machine. Here are the results:
SuSe LE 10.3 (gcc 4.1.2, -O3, AMD Opteron @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.620000 C qsort with Python style compare: 0.610000 NumPy newqsort: 0.380000
So, I'm getting a 55% more of performance than by using gcc 4.2.1 (!). Also, I've installed gcc 4.2.1 on my laptop and here are the results:
Ubuntu 7.10 (gcc 4.2.1, -O3, Intel Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.420000 NumPy newqsort: 0.630000
While using gcc 4.1.2, I get:
Ubuntu 7.10 (gcc 4.1.3, -O3, Intel Pentium 4 @ 2 GHz) Benchmark with 1000000 strings of size 15 C qsort with C style compare: 2.450000 C qsort with Python style compare: 2.440000 NumPy newqsort: 0.650000
So, in this case (32-bit platform) gcc 4.2.1 seems to perform similarly to 4.1.2.
So, I'd say that the guilty is the gcc 4.2.1, 64-bit (or at very least, AMD Opteron architecture) and that newqsort performs really well in general (provided that the compiler can find the best path for optimizing its code). Anyone using a 64-bit platform and having both gcc 4.1.2 and 4.2.1 installed can confirm this?
Also, MSVC 7.1 32-bit (with opt level /Ox) doesn't seem to find such a best path (the benchmark for newqsort takes 0.92s using MSVC 7.1, while gcc 4.1.2 takes 0.65s using the same machine, a 40% faster). I don't know whether newer versions of MSVC will do better or not, though.
Now we need someone to try ICC ;) Chuck
On Wednesday 13 February 2008 02:37:37 pm Francesc Altet wrote:
So, I'd say that the guilty is the gcc 4.2.1, 64-bit (or at very least, AMD Opteron architecture) and that newqsort performs really well in general (provided that the compiler can find the best path for optimizing its code). Anyone using a 64-bit platform and having both gcc 4.1.2 and 4.2.1 installed can confirm this?
Here are results from a 64-bit Debian system using a Core2 Duo 2.66 GHz processor. I used gcc 3.4.6, 4.1.3, 4.2.3, and 4.3.0 (20080202 experimental) with -O2 and -O3. Summary: There is a big difference between -02 and -O3. gcc-4.2 seems slightly better than the other gccs. And the newqsort is a lot faster (always) than the libc version. Scott eiger:/data1$ ./sort346_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.550000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.450000 eiger:/data1$ ./sort346_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.550000 C qsort with Python style compare: 0.520000 NumPy newqsort: 0.350000 eiger:/data1$ ./sort413_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.560000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.420000 eiger:/data1$ ./sort413_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.540000 C qsort with Python style compare: 0.500000 NumPy newqsort: 0.280000 eiger:/data1$ ./sort423_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.560000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.390000 eiger:/data1$ ./sort423_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.530000 C qsort with Python style compare: 0.500000 NumPy newqsort: 0.270000 eiger:/data1$ ./sort43_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.550000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.340000 eiger:/data1$ ./sort43_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.530000 C qsort with Python style compare: 0.510000 NumPy newqsort: 0.330000 -- Scott M. Ransom Address: NRAO Phone: (434) 296-0320 520 Edgemont Rd. email: sransom@nrao.edu Charlottesville, VA 22903 USA GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989
A Wednesday 13 February 2008, Scott Ransom escrigué:
On Wednesday 13 February 2008 02:37:37 pm Francesc Altet wrote:
So, I'd say that the guilty is the gcc 4.2.1, 64-bit (or at very least, AMD Opteron architecture) and that newqsort performs really well in general (provided that the compiler can find the best path for optimizing its code). Anyone using a 64-bit platform and having both gcc 4.1.2 and 4.2.1 installed can confirm this?
Here are results from a 64-bit Debian system using a Core2 Duo 2.66 GHz processor.
I used gcc 3.4.6, 4.1.3, 4.2.3, and 4.3.0 (20080202 experimental) with -O2 and -O3.
Summary: There is a big difference between -02 and -O3. gcc-4.2 seems slightly better than the other gccs. And the newqsort is a lot faster (always) than the libc version.
Scott
eiger:/data1$ ./sort346_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.550000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.450000
eiger:/data1$ ./sort346_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.550000 C qsort with Python style compare: 0.520000 NumPy newqsort: 0.350000
eiger:/data1$ ./sort413_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.560000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.420000
eiger:/data1$ ./sort413_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.540000 C qsort with Python style compare: 0.500000 NumPy newqsort: 0.280000
eiger:/data1$ ./sort423_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.560000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.390000
eiger:/data1$ ./sort423_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.530000 C qsort with Python style compare: 0.500000 NumPy newqsort: 0.270000
eiger:/data1$ ./sort43_O2 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.550000 C qsort with Python style compare: 0.530000 NumPy newqsort: 0.340000
eiger:/data1$ ./sort43_O3 Benchmark with 1000000 strings of size 15 C qsort with C style compare: 0.530000 C qsort with Python style compare: 0.510000 NumPy newqsort: 0.330000
Thanks Scott. Your input is very valuable, as it seems to confirm that the problem must be on gcc 4.2.1 on 64-bit (or Opteron architecture at very least) because apparently your gcc 4.2.3 is doing very well. It's a pity that I don't have a 4.2.3 available in our SuSe/Opteron machine so as to check if the optimization flaw disappears. But it seems to me that the problem could be specific of 4.2.1, and apparently the GCC crew has fixed the problem in 4.2.3, which is a relief. In any case, if anybody have access to an Opteron machine and gcc 4.2.3, it would be great if he can run the benchmark and contribute his feedback. Cheers, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
On Feb 11, 2008 2:58 AM, Francesc Altet <faltet@carabos.com> wrote:
A Monday 11 February 2008, Charles R Harris escrigué:
On Feb 8, 2008 5:29 AM, Francesc Altet <faltet@carabos.com> wrote:
Hi,
I'm a bit confused that the sort method of a string character doesn't
allow a mergesort:
s = numpy.empty(10, "S10") s.sort(kind="merge")
TypeError: desired sort not supported for this type
However, by looking at the numpy sources, it seems that the only implemented method for sorting array strings is "merge" (I presume because it is stable). So, perhaps the message above should be fixed.
The only available method is the default quicksort. The mergesort is for argsort and was put in for lexsort to use.
That's good to know. However, I'm curious about where it is the specific quicksort implementation for strings/unicode. I've had a look at _sortmodule.c.src, but I can only find a quicksort implementation for:
The default is the C qsort, it is called from PyArray_Sort in multiarraymodule.c. The type specific sorts, when they exist, are also called from there through _new_sort. To see what type specific sorts are registered, look at the end of _sortmodule.c.src. You can write a new sort, and if it isn't registered it won't be used. So commenting out the registration is a good way to get back to the default.
The version you are testing is your own or the one that I provided? Here are the timings for my laptop:
They turned out to be essentially identical, except I used len instead of ss ;) I also used inlined functions for the copy and swap as they are safer with the arguments. Comparison with the macro versions showed no difference there.
In [55]: a = np.random.rand(10000).astype('S8')
In [56]: %timeit a.copy().sort() 100 loops, best of 3: 3.82 ms per loop
In [57]: %timeit newqsort(a.copy()) 100 loops, best of 3: 3.29 ms per loop
Here, the difference in performance has been reduced to a mere 15% (still favouring newqsort). So, with this, it seems like the performance of the original sorting in NumPy only suffers a lot when running in old processors (eg. Pentium 4), while the performance is reasonable with newer ones (Opteron).
It could also depend on the C library that comes with the compiler. I'm running on a E6600, but numpy compiles everything for the i386, which might make a difference also. Chuck
participants (12)
-
Albert Strasheim -
Bruce Southey -
Charles R Harris -
Damian Eads -
David Cournapeau -
David Cournapeau -
David Cournapeau -
Francesc Altet -
Jon Wright -
Lou Pecora -
Robert Kern -
Scott Ransom