Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
In this case, the 'kind' does not specify how large the data-type is. You can have 'u1', 'u2', 'u4', etc.
The same is true with Unicode. You can have 10-character unicode elements, 20-character, etc. But, we have to be clear about what a "character" is in the data-format.
That is certainly confusing. In u1, u2, u4, the digit seems to indicate the size of a single value (1 byte, 2 bytes, 4 bytes). Right? Yet, in U20, it does *not* indicate the size of a single value but of an array? And then, it's not the size, but the number of elements?
Good point. In NumPy, unicode support was added "in parallel" with string arrays where there is not the ambiguity. So, yes, it's true that the unicode case is a special-case.
The other way to handle it would be to describe the 'code'-point size (i.e. 'U1', 'U2', 'U4' for UCS-1, UCS-2, UCS-4) and then have the length be encoded as an "array" of those types.
This was not the direction we took with NumPy (which is what I'm using as a reference) because I wanted Unicode and string arrays to look the same and thought of strings differently.
How to handle unicode data-formats could definitely be improved. Suggestions are welcome.