[Numpy-discussion] String type again.

Jeff Reback jeffreback at gmail.com
Tue Jul 15 06:56:11 EDT 2014


in 0.15.0 pandas will have full fledged support for categoricals which in effect allow u 2 map a smaller number of strings to integers 

this is now in pandas master 

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

feedback welcome!

> On Jul 14, 2014, at 1:00 PM, Olivier Grisel <olivier.grisel at ensta.org> wrote:
> 
> 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndarray at mac.com>:
>> 
>>> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>> 
>>> I feel like for most purposes, what we *really* want is a variable length
>>> string dtype (I.e., where each element can be a different length.).
>> 
>> 
>> 
>> I've been toying with the idea of creating an array type for interned
>> strings.  In many applications dealing with large arrays of variable size
>> strings, the strings come from a relatively short set of names.  Arrays of
>> interned strings can be manipulated very efficiently because in may respects
>> they are just like arrays of integers.
> 
> +1 I think this is why pandas is using dtype=object to load string
> data: in many cases short string values are used to represent
> categorical variables with a comparatively small cardinality of
> possible values for a dataset with comparatively numerous records.
> 
> In that case the dtype=object is not that bad as it just stores
> pointer on string objects managed by Python. It's possible to intern
> the strings manually at load time (I don't know if pandas or python
> already do it automatically in that case). The integer semantics is
> good for that case. Having an explicit dtype might be even better.
> 
> -- 
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the NumPy-Discussion mailing list