[Numpy-discussion] String type again.
Jeff Reback
jeffreback at gmail.com
Tue Jul 15 06:56:11 EDT 2014
in 0.15.0 pandas will have full fledged support for categoricals which in effect allow u 2 map a smaller number of strings to integers
this is now in pandas master
http://pandas-docs.github.io/pandas-docs-travis/categorical.html
feedback welcome!
> On Jul 14, 2014, at 1:00 PM, Olivier Grisel <olivier.grisel at ensta.org> wrote:
>
> 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray at mac.com>:
>>
>>> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>
>>> I feel like for most purposes, what we *really* want is a variable length
>>> string dtype (I.e., where each element can be a different length.).
>>
>>
>>
>> I've been toying with the idea of creating an array type for interned
>> strings. In many applications dealing with large arrays of variable size
>> strings, the strings come from a relatively short set of names. Arrays of
>> interned strings can be manipulated very efficiently because in may respects
>> they are just like arrays of integers.
>
> +1 I think this is why pandas is using dtype=object to load string
> data: in many cases short string values are used to represent
> categorical variables with a comparatively small cardinality of
> possible values for a dataset with comparatively numerous records.
>
> In that case the dtype=object is not that bad as it just stores
> pointer on string objects managed by Python. It's possible to intern
> the strings manually at load time (I don't know if pandas or python
> already do it automatically in that case). The integer semantics is
> good for that case. Having an explicit dtype might be even better.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list