[Numpy-discussion] Extent of unicode types in numpy

Mon Feb 6 10:25:07 EST 2006

Hi,

I'm a bit surprised by the fact that unicode types are the only ones
breaking the rule that must be specified with a different number of
bytes than it really takes. For example:

In [120]:numpy.dtype([('x','c16')])
Out[120]:dtype([('x', '<c16')])

In [121]:numpy.dtype([('x','S16')])
Out[121]:dtype([('x', '|S16')])

but:

In [119]:numpy.dtype([('x','U4')])
Out[119]:dtype([('x', '<U16')])

Even worse:

In [126]:numpy.dtype(numpy.dtype('u4').str)
Out[126]:dtype('<u4')

but:

In [125]:numpy.dtype(numpy.dtype('U4').str)
Out[125]:dtype('<U64')   # !!!!

which can quickly led to problems in users' code.

I think that, for the sake of consistency and exactly like the user must
know that a c16 is a complex taking 16 octets, he must know that a
unicode character should take 4 bytes. With this, we should have:

In [119]:numpy.dtype([('x','U4')])
Out[119]:dtype([('x', '<U4')])

and forbid unicode character length that are not multiple of 4. I know
that, initially, it would be a bit strange for the user to specify 'S4'
for a string with 4 chars and 'U16' for an unicode string of 4 chars as
well, but hopefully he would be used soon to this.

The only problem with that I see with what I'm proposing is that I don't
know whether the unicode would take always 4-bytes in all the platforms
(--> 64-bit issues?). OTOH, I thought that Python would represent
internally unicode strings with 16-bit chars. Oh well, I'm bit lost on
this. Anybody can bring some light?

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"