[Python-Dev] Help with Unicode arrays in NumPy
"Martin v. Löwis"
martin at v.loewis.de
Tue Feb 7 21:53:16 CET 2006
Travis E. Oliphant wrote:
> Numpy supports arrays of arbitrary fixed-length "records". It is
> much more than numeric-only data now. One of the fields that a
> record can contain is a string. If strings are supported, it makes
> sense to support unicode strings as well.
Hmm. How do you support strings in fixed-length records? Strings are
variable-sized, after all.
On common application is that you have a C struct in some API which
has a fixed-size array for string data (either with a length field,
or null-terminated), in this case, it is moderately useful to model
such a struct in Python. However, transferring this to Unicode is
pointless - there aren't any similar Unicode structs that need
support.
> This allows NumPy to memory-map arbitrary data-files on disk.
Ok, so this is the "C struct" case. Then why do you need Unicode
support there? Which common file format has embedded fixed-size
Unicode data?
> Perhaps you should explain why you think NumPy "shouldn't support
> Unicode"
I think I said "Unicode arrays", not Unicode. Unicode arrays are
a pointless data type, IMO. Unicode always comes in strings
(i.e. variable sized, either null-terminated or with an introducing
length). On disk/on the wire Unicode comes as UTF-8 more often
than not.
Using UCS-2/UCS-2 as an on-disk represenationis also questionable
practice (although admittedly Microsoft uses that a lot).
> That is currently what is done. The current unicode data-type is
> exactly what Python uses.
Then I wonder how this goes along with the use case "allow to
map arbitrary files".
> The chararray subclass gives to unicode and string arrays all the
> methods of unicode and strings (operating on an element-by-element
> basis).
For strings, I can see use cases (although I wonder how you deal
with data formats that also support variable-sized strings, as
most data formats supporting strings do).
> Please explain why having zero of them is *sufficient*.
Because I (still) cannot imagine any specific application that
might need such a feature (IOWYAGNI).
>> If the purpose is to support arbitrary Unicode characters, it
>> should use 4 bytes (as two bytes are insufficient to represent
>> arbitrary Unicode characters).
>
>
> And Python does not support arbitrary Unicode characters on narrow
> builds? Then how is \U0010FFFF represented?
It's represented using UTF-16. Try this for yourself:
py> len(u"\U0010FFFF")
2
py> u"\U0010FFFF"[0]
u'\udbff'
py> u"\U0010FFFF"[1]
u'\udfff'
This has all kinds of non-obvious implications.
> The purpose is to represent bytes as they might exist in a file or
> data-stream according to the users specification.
See, and this is precisely the statement that I challenge. Sure,
they "might" exist - but I'd rather expect that they don't.
If they exist, "Unicode" might come as variable-sized UTF-8, UTF-16,
or UTF-32. In either case, NumPy should already support that by
mapping a string object onto the encoded bytes, to which you then
can apply .decode() should you need to process the actual Unicode
data.
> The purpose is
> whatever the user wants them for. It's the same purpose as having an
> unsigned 64-bit data-type --- because users may need it to represent
> data as it exists in a file.
No. I would expect you have 64-bit longs because users *do* need them,
and because there wouldn't be an easy work-around if users wouldn't have
them. For Unicode, it's different: users don't directly need them
(atleast not many users), and if they do, there is an easy work-around
for their absence.
Say I want to process NTFS run lists. In NTFS run lists, there are
24-bit integers, 40-bit integers, and 4-bit integers (i.e. nibbles).
Can I represent them all in NumPy? Can I have NumPy transparently
map a sequence of run list records (which are variable-sized)
map as an array of run list records?
Regards,
Martin
More information about the Python-Dev
mailing list