[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 12:34:46 EDT 2017

This is essentially my rant about use-case (2):

A compact dtype for mostly-ascii text:

On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker at noaa.gov>
> wrote:
>
>> On the other hand, if this is the use-case, perhaps we really want an
>>> encoding closer to "Python 2" string, i.e, "unknown", to let this be
>>> signaled more explicitly. I would suggest that "text[unknown]" should
>>> support operations like a string if it can be decoded as ASCII, and
>>> otherwise error. But unlike "text[ascii]", it will let you store arbitrary
>>> bytes.
>>>
>>
>> I _think_ that is what using latin-1 (Or latin-9) gets you -- if it
>> really is ascii, then it's perfect. If it really is latin-*, then you get
>> some extra useful stuff, and if it's corrupted somehow, you still get the
>> ascii text correct, and the rest won't  barf and can be passed on through.
>>
>
> I am totally in agreement with Thomas that "We are living in a messy
> world right now with messy legacy datasets that have character type data
> that are *mostly* ASCII, but not infrequently contain non-ASCII characters."
>
> My question: What are those non-ASCII characters? How often are they truly
> latin-1/9 vs. some other text encoding vs. non-string binary data?
>

I am totally euro-centric, but as I understand it, that is the whole point
of the desire for a compact one-byte-per character encoding. If there is a
strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
should support that. But this all started with "mostly ascii". My take on
that is:

We don't want to use pure-ASCII -- that is the hell that python2's default
encoding approach led to -- it is MUCH better to pass garbage through than
crash out with an EncodingError -- data are messy, and people are really
bad at writing comprehensive tests.

So we need something that handles ASCII properly, and can pass trhough
arbitrary bytes as well without crashing. Options are:

* ASCII With errors='ignore' or 'replace'

I think that is a very bad idea -- it is tossing away information that
_may_ have some use eslewhere::

  s = arr[i]
  arr[i] = s

should put the same bytes back into the array.

* ASCII with errors='surrogateescape'

This would preserve bytes and not crash out, so meets the key criteria.

* latin-1

This would do the exactly correct thing for ASCII, preserve the bytes, and
not crash out. But it would also allow additional symbols useful to
european languages and scientific computing. Seems like a win-win to me.

As for my use-cases:

 - Messy data:

I have had a lot of data sets with european text in them, mostly ASCII and
an occasional non ASCII accented character or symbol -- most of these come
from legacy systems, and have an ugly arbitrary combination of MacRoman,
Win-something-or-other, and who knows what -- i.e. mojibake, though at
least mostly ascii.

The only way to deal with it "properly" is to examine each string and try
to figure out which encoding it is in, hope at least a single string is in
one encoding, and then decode/encode it properly. So numpy should support
that -- which would be handled by a 'bytes' type, just like in Python
itself.

But sometimes that isn't practical, and still doesn't work 100% -- in which
case, we can go with latin-1, and there will be some weird, incorrect
characters in there, and that is OK -- we fix them later when QA/QC or
users notice it -- really just like a typo.

But stripping the non-ascii characters out would be a worse solution. As
would "replace", as sometimes it IS the correct symbol! (european encodings
aren't totally incompatible...). And surrogateescape is worse, too -- any
"weird" character is the same to my users, and at least sometimes it will
be the right character -- however surrogateescape gets printed, it will
never look right. (and can it even be handles by a non-python system?)

 - filenames

File names are one of the key reasons folks struggled with the python3 data
model (particularly on *nix) and why 'surrogateescape' was added. It's
pretty common to store filenames in with our data, and thus in numpy arrays
-- we need to preserve them exactly and display them mostly right. Again,
euro-centric, but if you are euro-centric, then latin-1 is a good choice
for this.

Granted, I should probably simply use a proper unicode type for filenames
anyway, but sometimes the data comes in already encoded as latin-something.

In the end I still see no downside to latin-1 over ascii-only -- only an
upside.

I don't think that silently (mis)interpreting non-ASCII characters as
> latin-1/9 is a good idea, which is why I think it would be a mistake to use
> 'latin-1' for text data with unknown encoding.
>

if it's totally unknown, then yes -- but for totally uknown, bytes is the
only reasonable option -- then run chardet or something over it.

but "some latin encoding" -- latin-1 is a good choice.

I could get behind a data type that compares equal to strings for ASCII
> only and allows for *storing* other characters, but making blind
> assumptions about characters 128-255 seems like a recipe for disaster.
> Imagine text[unknown] as a one character string type, but it supports
> .decode() like bytes and every character in the range 128-255 compares for
> equality with other characters like NaN -- not even equal to itself.
>

would this be ascii with surrogateescape? -- almost, though I think the
surrogateescapes would compare equal if they were equal -- which, now that
I think about it would be what you want -- why preserve the bytes if they
aren't an important part of the data?

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/7c1f6826/attachment-0001.html>