[Numpy-discussion] String type again.

Olivier Grisel olivier.grisel at ensta.org
Mon Jul 14 13:00:45 EDT 2014


2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray at mac.com>:
>
> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> I feel like for most purposes, what we *really* want is a variable length
>> string dtype (I.e., where each element can be a different length.).
>
>
>
> I've been toying with the idea of creating an array type for interned
> strings.  In many applications dealing with large arrays of variable size
> strings, the strings come from a relatively short set of names.  Arrays of
> interned strings can be manipulated very efficiently because in may respects
> they are just like arrays of integers.

+1 I think this is why pandas is using dtype=object to load string
data: in many cases short string values are used to represent
categorical variables with a comparatively small cardinality of
possible values for a dataset with comparatively numerous records.

In that case the dtype=object is not that bad as it just stores
pointer on string objects managed by Python. It's possible to intern
the strings manually at load time (I don't know if pandas or python
already do it automatically in that case). The integer semantics is
good for that case. Having an explicit dtype might be even better.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel



More information about the NumPy-Discussion mailing list