This is essentially my rant about use-case (2):

A compact dtype for mostly-ascii text:

On Mon, Apr 24, 2017 at 4:09 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Mon, Apr 24, 2017 at 11:13 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On the other hand, if this is the use-case, perhaps we really want an encoding closer to "Python 2" string, i.e, "unknown", to let this be signaled more explicitly. I would suggest that "text[unknown]" should support operations like a string if it can be decoded as ASCII, and otherwise error. But unlike "text[ascii]", it will let you store arbitrary bytes.

I _think_ that is what using latin-1 (Or latin-9) gets you -- if it really is ascii, then it's perfect. If it really is latin-*, then you get some extra useful stuff, and if it's corrupted somehow, you still get the ascii text correct, and the rest won't  barf and can be passed on through.

I am totally in agreement with Thomas that "We are living in a messy world right now with messy legacy datasets that have character type data that are *mostly* ASCII, but not infrequently contain non-ASCII characters."

My question: What are those non-ASCII characters? How often are they truly latin-1/9 vs. some other text encoding vs. non-string binary data?

I am totally euro-centric, but as I understand it, that is the whole point of the desire for a compact one-byte-per character encoding. If there is a strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we should support that. But this all started with "mostly ascii". My take on that is:

We don't want to use pure-ASCII -- that is the hell that python2's default encoding approach led to -- it is MUCH better to pass garbage through than crash out with an EncodingError -- data are messy, and people are really bad at writing comprehensive tests.

So we need something that handles ASCII properly, and can pass trhough arbitrary bytes as well without crashing. Options are:

* ASCII With errors='ignore' or 'replace'

I think that is a very bad idea -- it is tossing away information that _may_ have some use eslewhere::

  s = arr[i]  
  arr[i] = s

should put the same bytes back into the array.

* ASCII with errors='surrogateescape'

This would preserve bytes and not crash out, so meets the key criteria.


* latin-1

This would do the exactly correct thing for ASCII, preserve the bytes, and not crash out. But it would also allow additional symbols useful to european languages and scientific computing. Seems like a win-win to me.

As for my use-cases:

 - Messy data:

I have had a lot of data sets with european text in them, mostly ASCII and an occasional non ASCII accented character or symbol -- most of these come from legacy systems, and have an ugly arbitrary combination of MacRoman, Win-something-or-other, and who knows what -- i.e. mojibake, though at least mostly ascii.

The only way to deal with it "properly" is to examine each string and try to figure out which encoding it is in, hope at least a single string is in one encoding, and then decode/encode it properly. So numpy should support that -- which would be handled by a 'bytes' type, just like in Python itself.

But sometimes that isn't practical, and still doesn't work 100% -- in which case, we can go with latin-1, and there will be some weird, incorrect characters in there, and that is OK -- we fix them later when QA/QC or users notice it -- really just like a typo.

But stripping the non-ascii characters out would be a worse solution. As would "replace", as sometimes it IS the correct symbol! (european encodings aren't totally incompatible...). And surrogateescape is worse, too -- any "weird" character is the same to my users, and at least sometimes it will be the right character -- however surrogateescape gets printed, it will never look right. (and can it even be handles by a non-python system?)

 - filenames

File names are one of the key reasons folks struggled with the python3 data model (particularly on *nix) and why 'surrogateescape' was added. It's pretty common to store filenames in with our data, and thus in numpy arrays -- we need to preserve them exactly and display them mostly right. Again, euro-centric, but if you are euro-centric, then latin-1 is a good choice for this.

Granted, I should probably simply use a proper unicode type for filenames anyway, but sometimes the data comes in already encoded as latin-something.

In the end I still see no downside to latin-1 over ascii-only -- only an upside.

I don't think that silently (mis)interpreting non-ASCII characters as latin-1/9 is a good idea, which is why I think it would be a mistake to use 'latin-1' for text data with unknown encoding.

if it's totally unknown, then yes -- but for totally uknown, bytes is the only reasonable option -- then run chardet or something over it.

but "some latin encoding" -- latin-1 is a good choice.

I could get behind a data type that compares equal to strings for ASCII only and allows for *storing* other characters, but making blind assumptions about characters 128-255 seems like a recipe for disaster. Imagine text[unknown] as a one character string type, but it supports .decode() like bytes and every character in the range 128-255 compares for equality with other characters like NaN -- not even equal to itself.

would this be ascii with surrogateescape? -- almost, though I think the surrogateescapes would compare equal if they were equal -- which, now that I think about it would be what you want -- why preserve the bytes if they aren't an important part of the data? 

-CHB


--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov