[Numpy-discussion] Extent of unicode types in numpy

Tue Feb 7 12:37:03 EST 2006

Francesc Altet wrote:

>El dt 07 de 02 del 2006 a les 12:26 -0700, en/na Travis Oliphant va
>escriure:
>  
>
>>Python itself hands us this difference.  Is it really so different then 
>>the fact that python integers are either 32-bit or 64-bit depending on 
>>the platform.  
>>
>>Perhaps what this is telling us, is that we do indeed need another 
>>data-type for 4-byte unicode.   It's how we solve the problem of 32-bit 
>>or 64-bit integers (we have a 64-bit integer on all platforms).
>>    
>>
>
>Agreed.
>
>  
>
>>Then in NumPy we can support going back and forth between UCS-2 (which 
>>we can then say is UTF-16) and UCS-4.
>>    
>>
>
>If this could be implemented, then excellent!
>  
>
Sure it could be implemented.  It's just a matter of effort.  Python 
itself always defines a Py_UCS4 type even on UCS2 builds.  We would just 
have to make sure Py_UCS2 is always defined as well. 

The biggest hassle is implementing the corresponding scalar type.  The 
one corresponding to the build for Python comes free.  The other would 
have to be implemented directly.

>The problem with unicode encodings is that most (I'm thinking in UTF-8
>and UTF-16) choose (correct me if I'm wrong here) a technique of
>surrogating pairs when trying to encode values that doesn't fit in a
>single word (7 bits for UTF-8 and 15 bits for UTF-16), which brings to a
>*variable* length of the coded output. And this is precisely the point:
>PyTables (as NumPy itself, or any other piece of software with
>efficiency in mind) would require a *fixed* space for keeping data, not
>a space that can be bigger or smaller depending on the number of
>surrogate pairs that should be used to encode a certain unicode string.
>  
>
You are correct that encoding introduces a variable byte-length per 
character (up to 6 for UTF-8 and up to 2 for UTF-16 I think).

I've seen data-bases handle this by warning the user to make sure the 
size of their data area is large enough to handle their longest use 
case.  You can still used fixed-sizes you just have to make sure they 
are large enough (or risk truncation).

>But, if what you are saying is that NumPy would adopt a 32-bit unicode
>type internally and then do the appropriate conversion to/from the
>python interpreter, then this is perfect, because it is the buffer of
>NumPy that will be used to be written/read to/from disk, not the Python
>object, and the buffer of such a NumPy object meets the requisites to
>become an efficient buffer: fixed length *and* large enough to keep
>*every* Unicode character without a need to use encodings.
>  
>
I see the value in such a buffer, I really do.  I'm just concerned about 
forcing everyone to use Python UCS4 builds.  That is way too 
stringent.   I'm afraid the only real solution is to implement a UCS2 
and a UCS4 data-type. 

>Well, I don't understand well here. I thought that you were proposing a
>32-bit unicode type for NumPy and then converting it appropriately to
>UCS2 (conversion to UCS4 wouldn't be necessary as it would be the same
>as the native NumPy unicode type) just in case that the user requires an
>scalar out of the NumPy object. But you are talking here about defining
>separate UCS4 and UCS2 data-types. I admit that I'm loosed here...
>
>  
>
I suppose that is another approach:  we could internally have all 
UNICODE data-types use 4-bytes and do the conversions necessary.  But, 
it would still require us to do most of work of supporting two 
data-types.  Currently, the unicode scalar object is a simple 
inheritance from Python's UNICODE data-type.  That would have to change 
and the work to do that is most of the work to support two different 
data-types.   So, if we are going to go through that effort.  I would 
rather see the result be two different Unicode data-types supported. 

-Travis