[Numpy-discussion] String type again.

Nathaniel Smith njs at pobox.com
Sat Jul 12 20:02:37 EDT 2014


On 12 Jul 2014 23:06, "Charles R Harris" <charlesr.harris at gmail.com> wrote:
>
> As previous posts have pointed out, Numpy's `S` type is currently treated
as a byte string, which leads to more complicated code in python3. OTOH,
the unicode type is stored as UCS4, which consumes a lot of space,
especially for ascii strings. This note proposes to adapt the currently
existing 'a' type letter, currently aliased to 'S', as a new fixed encoding
dtype. Python 3.3 introduced two one byte internal representations for
unicode strings, ascii and latin1. Ascii has the advantage that it is a
subset of UTF-8, whereas latin1 has a few more symbols. Another possibility
is to just make it an UTF-8 encoding, but I think this would involve more
overhead as Python would need to determine the maximum character size.
These are just preliminary thoughts, comments are welcome.

I feel like for most purposes, what we *really* want is a variable length
string dtype (I.e., where each element can be a different length.). Pandas
pays quite some price in overhead to fake this right now. Adding such a
thing will cause some problems regarding compatibility (what to do with
array(["foo"])) and education, but I think it's worth it in the long run. A
variable length string with out of band storage also would allow for a lot
of py3.3-style storage tricks of we want then.

Given that, though, I'm a little dubious about adding a third fixed length
string type, since it seems like it might be a temporary patch, yet raises
the prospect of having to indefinitely support *5* distinct string types (3
of which will map to py3 str)...

OTOH, fixed length nul padded latin1 would be useful for various flat file
reading tasks.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140713/a98d8be1/attachment.html>


More information about the NumPy-Discussion mailing list