[Numpy-discussion] proposal: smaller representation of string arrays
Chris Barker
chris.barker at noaa.gov
Tue Apr 25 12:52:06 EDT 2017
OK -- onto proposals:
1) The default behaviour for numpy arrays of strings is compatible with
> Python3's string model: i.e. fully unicode supporting, and with a character
> oriented interface. i.e. if you do::
>
> arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
> characters.
>
> and arr[1] will return a native Python3 string object.
>
> This is the use-case for "casual" numpy users -- not the folks writing
> H5py and the like, or the ones writing Cython bindings to C++ libs.
>
I see two options here:
a) The current 'U' dtype -- fully meets the specs, and is already there.
b) Having a pointer-to-a-python string dtype:
-I take it that's what Pandas does and people seem happy.
-That would get us variable length strings, and potentially other nifty
string-processing.
- It would lose the ability to interact at the binary level with other
systems -- but do any other systems use UCS-4 anyway?
- how would it work with pickle and numpy zip storage?
Personally, I'm fine with (a), but (b) seems like it could be a nice
addition. As the 'U' type already exists, the choice to add a python-string
type is really orthogonal to the rest of this discussion.
Note that I think using utf-8 internally to fit his need is a mistake -- it
does not match well with the Python string model.
That's it for use-case (1)
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/2761b3f0/attachment.html>
More information about the NumPy-Discussion
mailing list