[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 14:36:15 EDT 2017

On Mon, Apr 24, 2017 at 11:21 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>
> On Mon, Apr 24, 2017 at 10:51 AM, Aldcroft, Thomas <
aldcroft at head.cfa.harvard.edu> wrote:
>>>
>>> BTW -- maybe we should keep the pathological use-case in mind: really
short strings. I think we are all thinking in terms of longer strings,
maybe a name field, where you might assign 32 bytes or so -- then someone
has an accented character in their name, and then ge30 or 31 characters --
no big deal.
>>
>>
>> I wouldn't call it a pathological use case, it doesn't seem so uncommon
to have large datasets of short strings.
>
> It's pathological for using a variable-length encoding.
>
>> I personally deal with a database of hundreds of billions of 2 to 5
character ASCII strings.  This has been a significant blocker to Python 3
adoption in my world.
>
> I agree -- it is a VERY common case for scientific data sets. But a
one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
Unicode. The wasted space is not that big a deal with short strings...

Unless if you have hundreds of billions of them.

>> BTW, for those new to the list or with a short memory, this topic has
been discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!
>
> yes, let's hope so!
>
> The big difference now is that Julian seems to be committed to actually
making it happen!
>
> Thanks Julian!
>
> Which brings up a good point -- if you need us to stop the damn
bike-shedding so you can get it done -- say so.
>
> I have strong opinions, but would still rather see any of the ideas on
the table implemented than nothing.

FWIW, I prefer nothing to just adding a special case for latin-1. Solve the
HDF5 problem (i.e. fixed-length UTF-8 strings) or leave it be until someone
else is willing to solve that problem. I don't think we're at the
bikeshedding stage yet; we're still disagreeing about fundamental
requirements.

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/97eb6206/attachment-0001.html>