[Numpy-discussion] proposal: smaller representation of string arrays

Aldcroft, Thomas aldcroft at head.cfa.harvard.edu
Mon Apr 24 13:51:55 EDT 2017


On Mon, Apr 24, 2017 at 1:04 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>
>> In this case, we want something compatible with Python's string (i.e.
>>> full Unicode supporting) and I think should be as transparent as possible.
>>> Python's string has made the decision to present a character oriented API
>>> to users (despite what the manifesto says...).
>>>
>>
>> Yes, but NumPy doesn't really implement string operations, so fortunately
>> this is pretty irrelevant to us -- except for our API for specifying dtype
>> size.
>>
>
> Exactly -- the character-orientation of python strings means that people
> are used to thinking that strings have a length that is the number of
> characters in the string. I think there will a cognitive dissonance if
> someone does:
>
> arr[i] = a_string
>
> Which then raises a ValueError, something like:
>
> String too long for a string[12] dytype array.
>
> When len(a_string) <= 12
>
> AND that will only  occur if there are non-ascii characters in the string,
> and maybe only if there are more than N non-ascii characters. i.e. it is
> very likely to be a run-time error that may not have shown up in tests.
>
> So folks need to do something like:
>
> len(a_string.encode('utf-8')) to see if their string will fit. If not,
> they need to truncate it, and THAT is non-obvious how to do, too -- you
> don't want to truncate the encodes bytes naively, you could end up with an
> invalid bytestring. but you don't know how many characters to truncate,
> either.
>
>
>> We already have strong precedence for dtypes reflecting number of bytes
>> used for storage even when Python doesn't: consider numeric types like
>> int64 and float32 compared to the Python equivalents. It's an intrinsic
>> aspect of NumPy that users need to think about how their data is actually
>> stored.
>>
>
> sure, but a float64 is 64 bytes forever an always and the defaults
> perfectly match what python is doing under its hood --even if users don't
> think about. So the default behaviour of numpy matched python's built-in
> types.
>
>
> Storage cost is always going to be a concern. Arguably, it's even more of
>>> a concern today than it used to be be, because compute has been improving
>>> faster than storage.
>>>
>>
> sure -- but again, what is the use-case for numpy arrays with a s#$)load
> of text in them? common? I don't think so. And as you pointed out numpy
> doesn't do text processing anyway, so cache performance and all that are
> not important. So having UCS-4 as the default, but allowing folks to select
> a more compact format if they really need it is a good way to go. Just like
> numpy generally defaults to float64 and Int64 (or 32, depending on
> platform) -- users can select a smaller size if they have a reason to.
>
> I guess that's my summary -- just like with numeric values, numpy should
> default to Python-like behavior as much as possible for strings, too --
> with an option for a knowledgeable user to do something more performant.
>
>
>> I still don't understand why a latin encoding makes sense as a preferred
>> one-byte-per-char dtype. The world, including Python 3, has standardized on
>> UTF-8, which is also one-byte-per-char for (ASCII) scientific data.
>>
>
> utf-8 is NOT a one-byte per char encoding. IF you want to assure that your
> data are one-byte per char, then you could use ASCII, and it would be
> binary compatible with utf-8, but not sure what the point of that is in
> this context.
>
> latin-1 or latin-9 buys you (over ASCII):
>
> - A bunch of accented characters -- sure it only covers the latin
> languages, but does cover those much better.
>
> - A handful of other characters, including scientifically useful ones. (a
> few greek characters, the degree symbol, etc...)
>
> - round-tripping of binary data (at least with Python's encoding/decoding)
> -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the
> same bytes back. You may get garbage, but you won't get an EncodingError.
>

+1.  The key point is that there is a HUGE amount of legacy science data in
the form of FITS (astronomy-specific binary file format that has been the
primary file format for 20+ years) and HDF5 which uses a character data
type to store data which can be bytes 0-255.  Getting an decoding/encoding
error when trying to deal with these datasets is a non-starter from my
perspective.


>
> For Python use -- a pointer to a Python string would be nice.
>>>
>>
>> Yes, absolutely. If we want to be really fancy, we could consider a
>> parametric object dtype that allows for object arrays of *any* homogeneous
>> Python type. Even if NumPy itself doesn't do anything with that
>> information, there are lots of use cases for that information.
>>
>
> hmm -- that's nifty idea -- though I think strings could/should be special
> cased.
>
>
>> Then use a native flexible-encoding dtype for everything else.
>>>
>>
>> No opposition here from me. Though again, I think utf-8 alone would also
>> be enough.
>>
>
> maybe so -- the major reason for supporting others is binary data exchange
> with other libraries -- but maybe most of them have gone to utf-8 anyway.
>
> One more note: if a user tries to assign a value to a numpy string array
>>> that doesn't fit, they should get an error:
>>>
>>
>>> EncodingError if it can't be encoded into the defined encoding.
>>>
>>> ValueError if it is too long -- it should not be silently truncated.
>>>
>>
>> I think we all agree here.
>>
>
> I'm actually having second thoughts -- see above -- if the encoding is
> utf-8, then truncating is non-trivial -- maybe it would be better for numpy
> to do it for you. Or set a flag as to which you want?
>
> The current 'S' dtype truncates silently already:
>
> In [6]: arr
>
> Out[6]:
> array(['this', 'that'],
>       dtype='|S4')
>
> In [7]: arr[0] = "a longer string"
>
> In [8]: arr
>
> Out[8]:
> array(['a lo', 'that'],
>       dtype='|S4')
>
> (similarly for the unicode type)
>
> So at least we are used to that.
>
> BTW -- maybe we should keep the pathological use-case in mind: really
> short strings. I think we are all thinking in terms of longer strings,
> maybe a name field, where you might assign 32 bytes or so -- then someone
> has an accented character in their name, and then ge30 or 31 characters --
> no big deal.
>

I wouldn't call it a pathological use case, it doesn't seem so uncommon to
have large datasets of short strings.  I personally deal with a database of
hundreds of billions of 2 to 5 character ASCII strings.  This has been a
significant blocker to Python 3 adoption in my world.

BTW, for those new to the list or with a short memory, this topic has been
discussed fairly extensively at least 3 times before.  Hopefully the
*fourth* time will be the charm!

https://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html
https://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html
https://mail.scipy.org/pipermail/numpy-discussion/2015-February/072311.html

- Tom


>

>
> But what if you have a simple label or something with 1 or two characters:
> Then you have 2 bytes to store the name in, and someone tries to put an
> "odd" character in there, and you get an empty string. not good.
>

> Also -- if utf-8 is the default -- what do you get when you create an
> array from a python string sequence? Currently with the 'S' and 'U' dtypes,
> the dtype is set to the longest string passed in. Are we going to pad it a
> bit? stick with the exact number of bytes?
>
> It all comes down to this:
>
> Python3 has made a very deliberate (and I think Good) choice to treat text
> as a string of characters, where the user does not need to know or care
> about encoding issues. Numpy's defaults should do the same thing.
>
> -CHB
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/c0f1c402/attachment-0001.html>


More information about the NumPy-Discussion mailing list