[Numpy-discussion] String sort
Francesc Altet
faltet at carabos.com
Sat Feb 9 13:50:38 EST 2008
A Saturday 09 February 2008, Charles R Harris escrigué:
> > Well, for the unicode case it wouldn't be enough by replacing
> > 'char' by 'Py_ArrayUCS4'? Maybe this afternoon I can do some
> > benchmarking too in this regard.
>
> Looks like that for Numpy. The problem I was thinking about is that
> for wide characters Windows C defaults to UTF16 while the Unixes
> default to UTF32.
If it were so simple ;-) The fact is that the Python crew is delivering
the tarballs ready to compile with the UCS2 as default, and this
applies to both UNIX and Windows. However, some Linux distributions
(most in particular, Debian and derivatives), has chosen to make UCS4
the default in their Python packages.
This is not a (big) problem in itself, but when it comes to writing
arrays on disk and hope for portability (not only with different
platforms, but also with different UCS python interpreter in the same
machine!), we realized that this was a real problem (see discussion in
[1]). So, NumPy had to make a decision in that regard, and Travis
finally opted to only give support for the UCS4 charset in NumPy [2].
Also, he opened the door to possible UCS2 implementations in NumPy in
the future, but that would be a real pain, IMHO.
[1]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006081.html
[2]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006130.html
So, at least for the time being, you only have to worry about UCS4.
> The C99 standard didn't specify the exact length,
> but Numpy seems to use (or assume) UTF32.
Well, I should say that UTF32 and UCS4 are names referring to the same
thing, but most literature (and specially package configuration
procedures) talks about UCS4.
> Anyway, after doing some work to fool the optimizer and subtracting
> loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs
> 16e-9 seconds to compare strings of length 10. I've attached the
> program. Note that on my machine malloc appears to return zeroed
> memory, so the string compares always go to the end.
I've seen the benchmark, and the problem is that C strncmp stops
checking when it finds a \0 in the first string, while strncmp1 have to
check the complete set of chars in strings. However, you won't really
want to do C string comparisons with NumPy strings:
In [35]: ns1 = numpy.array("as\0as")
In [36]: ns2 = numpy.array("as\0bs")
In [37]: ns1 == ns2
Out[37]: array(False, dtype=bool)
In [38]: ns1 < ns2
Out[38]: array(True, dtype=bool)
or, with Python strings, in general:
In [39]: ns1 = "as\0as"
In [40]: ns2 = "as\0bs"
In [41]: ns1 == ns2
Out[41]: False
In [42]: ns1 < ns2
Out[42]: True
As you see, Python/NumPy strings are different beasts than C strings in
that regard. The strings in the latter always end with a \0 (NULL)
character, while in Python/NumPy the end is defined by a length
property (btw, the same than in Pascal, if you know it).
So, strncmp1 is not only faster than its C counterpart, but also the one
doing the correct job with NumPy (unicode) strings.
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the NumPy-Discussion
mailing list