[Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern robert.kern at gmail.com
Mon Apr 24 19:23:37 EDT 2017


On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker at noaa.gov>
wrote:
>>>
>>> On the other hand, if this is the use-case, perhaps we really want an
encoding closer to "Python 2" string, i.e, "unknown", to let this be
signaled more explicitly. I would suggest that "text[unknown]" should
support operations like a string if it can be decoded as ASCII, and
otherwise error. But unlike "text[ascii]", it will let you store arbitrary
bytes.
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
really is ascii, then it's perfect. If it really is latin-*, then you get
some extra useful stuff, and if it's corrupted somehow, you still get the
ascii text correct, and the rest won't  barf and can be passed on through.
>
> I am totally in agreement with Thomas that "We are living in a messy
world right now with messy legacy datasets that have character type data
that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they
truly latin-1/9 vs. some other text encoding vs. non-string binary data?

I don't know that we can reasonably make that accounting relevant. Number
of such characters per byte of text? Number of files with such characters
out of all existing files?

What I can say with assurance is that every time I have decided, as a
developer, to write code that just hardcodes latin-1 for such cases, I have
regretted it. While it's just personal anecdote, I think it's at least
measuring the right thing. :-)

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/02d60b79/attachment-0001.html>


More information about the NumPy-Discussion mailing list