[Numpy-discussion] String sort

Sat Feb 9 13:50:38 EST 2008

A Saturday 09 February 2008, Charles R Harris escrigué:
> > Well, for the unicode case it wouldn't be enough by replacing
> > 'char' by 'Py_ArrayUCS4'?  Maybe this afternoon I can do some
> > benchmarking too in this regard.
>
> Looks like that for Numpy. The problem I was thinking about is that
> for wide characters Windows C defaults to UTF16 while the Unixes
> default to UTF32.

If it were so simple ;-)  The fact is that the Python crew is delivering 
the tarballs ready to compile with the UCS2 as default, and this 
applies to both UNIX and Windows.  However, some Linux distributions 
(most in particular, Debian and derivatives), has chosen to make UCS4 
the default in their Python packages.

This is not a (big) problem in itself, but when it comes to writing 
arrays on disk and hope for portability (not only with different 
platforms, but also with different UCS python interpreter in the same 
machine!), we realized that this was a real problem (see discussion in 
[1]).  So, NumPy had to make a decision in that regard, and Travis 
finally opted to only give support for the UCS4 charset in NumPy [2].  
Also, he opened the door to possible UCS2 implementations in NumPy in 
the future, but that would be a real pain, IMHO.

[1]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006081.html
[2]http://projects.scipy.org/pipermail/numpy-discussion/2006-February/006130.html

So, at least for the time being, you only have to worry about UCS4.

> The C99 standard didn't specify the exact length, 
> but Numpy seems to use (or assume) UTF32.

Well, I should say that UTF32 and UCS4 are names referring to the same 
thing, but most literature (and specially package configuration 
procedures) talks about UCS4.

> Anyway, after doing some work to fool the optimizer and subtracting
> loop overhead, strncmp still comes out a bit faster for me, 11e-9 vs
> 16e-9 seconds to compare strings of length 10. I've attached the
> program. Note that on my machine malloc appears to return zeroed
> memory, so the string compares always go to the end.

I've seen the benchmark, and the problem is that C strncmp stops 
checking when it finds a \0 in the first string, while strncmp1 have to 
check the complete set of chars in strings.  However, you won't really 
want to do C string comparisons with NumPy strings:

In [35]: ns1 = numpy.array("as\0as")

In [36]: ns2 = numpy.array("as\0bs")

In [37]: ns1 == ns2
Out[37]: array(False, dtype=bool)

In [38]: ns1 < ns2
Out[38]: array(True, dtype=bool)

or, with Python strings, in general:

In [39]: ns1 = "as\0as"

In [40]: ns2 = "as\0bs"

In [41]: ns1 == ns2
Out[41]: False

In [42]: ns1 < ns2
Out[42]: True

As you see, Python/NumPy strings are different beasts than C strings in 
that regard.  The strings in the latter always end with a \0 (NULL) 
character, while in Python/NumPy the end is defined by a length 
property (btw, the same than in Pascal, if you know it).

So, strncmp1 is not only faster than its C counterpart, but also the one 
doing the correct job with NumPy (unicode) strings.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"