On Feb 9, 2008 11:50 AM, Francesc Altet <faltet@carabos.com> wrote:
A Saturday 09 February 2008, Charles R Harris escrigué:
Well, for the unicode case it wouldn't be enough by replacing 'char' by 'Py_ArrayUCS4'? Maybe this afternoon I can do some benchmarking too in this regard.
Looks like that for Numpy. The problem I was thinking about is that for wide characters Windows C defaults to UTF16 while the Unixes default to UTF32.
If it were so simple ;-) The fact is that the Python crew is delivering the tarballs ready to compile with the UCS2 as default, and this applies to both UNIX and Windows. However, some Linux distributions (most in particular, Debian and derivatives), has chosen to make UCS4 the default in their Python packages.
This is not a (big) problem in itself, but when it comes to writing arrays on disk and hope for portability (not only with different platforms, but also with different UCS python interpreter in the same machine!), we realized that this was a real problem (see discussion in [1]). So, NumPy had to make a decision in that regard, and Travis finally opted to only give support for the UCS4 charset in NumPy [2]. Also, he opened the door to possible UCS2 implementations in NumPy in the future, but that would be a real pain, IMHO.
[1]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006081.ht...
[2]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006130.ht...
So, at least for the time being, you only have to worry about UCS4.
The C99 standard didn't specify the exact length, but Numpy seems to use (or assume) UTF32.
Well, I should say that UTF32 and UCS4 are names referring to the same thing, but most literature (and specially package configuration procedures) talks about UCS4.
Anyway, after doing some work to fool the optimizer and subtracting loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs 16e-9 seconds to compare strings of length 10. I've attached the program. Note that on my machine malloc appears to return zeroed memory, so the string compares always go to the end.
I've seen the benchmark, and the problem is that C strncmp stops checking when it finds a \0 in the first string, while strncmp1 have to check the complete set of chars in strings. However, you won't really want to do C string comparisons with NumPy strings:
In [35]: ns1 = numpy.array("as\0as")
In [36]: ns2 = numpy.array("as\0bs")
In [37]: ns1 == ns2 Out[37]: array(False, dtype=bool)
In [38]: ns1 < ns2 Out[38]: array(True, dtype=bool)
or, with Python strings, in general:
In [39]: ns1 = "as\0as"
In [40]: ns2 = "as\0bs"
In [41]: ns1 == ns2 Out[41]: False
In [42]: ns1 < ns2 Out[42]: True
As you see, Python/NumPy strings are different beasts than C strings in that regard. The strings in the latter always end with a \0 (NULL) character, while in Python/NumPy the end is defined by a length property (btw, the same than in Pascal, if you know it).
So, strncmp1 is not only faster than its C counterpart, but also the one doing the correct job with NumPy (unicode) strings.
Ah, in that case the current indirect sort for NumPy strings, which uses strncmp, is incorrect and needs to be fixed. It seems that strings with zeros are not part of the current test series ;) Chuck