[Numpy-discussion] proposal: smaller representation of string arrays

Mon Apr 24 19:08:50 EDT 2017

Chris, you've mashed all of my emails together, some of them are in reply
to you, some in reply to others. Unfortunately, this dropped a lot of the
context from each of them, and appears to be creating some
misunderstandings about what each person is advocating.

On Mon, Apr 24, 2017 at 2:00 PM, Chris Barker <chris.barker at noaa.gov> wrote:
>
> On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern at gmail.com>
wrote:

>> Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>
> I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?

If the point is to have an array that transparently accepts/yields
`unicode/str` scalars while maintaining the in-memory encoding, yes. If
that's not the point, then IMO the status quo is fine, and *no* new dtypes
should be added, just maybe some utility functions to convert between the
bytes-ish arrays and the Unicode-holding arrays (which was one of my
proposals). I am mostly happy to live in a world where I read in data as
bytes-ish arrays, decode into `object` arrays holding `unicode/str`
objects, do my manipulations, then encode the array into a bytes-ish array
to give to the C API or file format.

>> or leave it be until someone else is willing to solve that problem. I
don't think we're at the bikeshedding stage yet; we're still disagreeing
about fundamental requirements.
>
> yeah -- though I've seen projects get stuck in the sorting out what to
do, so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

I understand, but not all tedious discussions that have not yet achieved
consensus are bikeshedding to be cut short. We couldn't really decide what
to do back in the pre-1.0 days, too, so we just did *something*, and that
something is now the very situation that Julian has a problem with.

We have more experience now, especially with the added wrinkles of Python
3; other projects have advanced and matured their Unicode string
array-handling (e.g. pandas and HDF5); now is a great time to have a real
discussion about what we *need* before we make decisions about what we
should *do*.

> So here I'll lay out what I think are the fundamental requirements:
>
> 1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:
>
> arr = np.array(("this", "that",))
>
> you get an array that can store ANY unicode string with 4 or less
characters
>
> and arr[1] will return a native Python string object.
>
> 2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.
>
> I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the
encoding in this case.
>
> 3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)
>
> 4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
>  - you could use astype() to convert between bytes and a specified
encoding with no change in binary representation.

You'll need to specify what NULL-terminating behavior you want here.
np.string_ has NULL-termination. np.void (which could be made to work
better with `bytes`) does not. Both have use-cases for text encoding
(shakes fist at UTF-16).

> 2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.
>
> 1) could be covered with the existing 'U': type - only downside being
some wasted space -- or with a pointer to a python string dtype -- which
would also waste space, though less for long-ish strings, and maybe give us
some better access to the nifty built-in string features.
>
>> > +1.  The key point is that there is a HUGE amount of legacy science
data in the form of FITS (astronomy-specific binary file format that has
been the primary file format for 20+ years) and HDF5 which uses a character
data type to store data which can be bytes 0-255.  Getting an
decoding/encoding error when trying to deal with these datasets is a
non-starter from my perspective.
>
>> That says to me that these are properly represented by `bytes` objects,
not `unicode/str` objects encoding to and decoding from a hardcoded latin-1
encoding.
>
> Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit
Python3 is the butt for a long time. Folks that deal in the messy real
world of binary data that is kinda-mostly text, but may have a bit of
binary data, or be in an unknown encoding, or be corrupted were very, very
adamant about how this model DID NOT work for them. Very influential people
were seriously critical of python 3. Eventually, py3 added bytes string
formatting, surrogate_escape, and other features that facilitate working
with messy almost text.

Walk me through a problem that you've encountered with such textish data in
arrays. I know the problems in Web protocol-land, but they are not really
relevant to us. What are *your* problems? Why didn't those ameliorations
that were added for the Web world address your problems? I really want to
get at specific use cases that interact with numpy, not handwaving at
problems other people have had in other contexts.

> Practicality beats purity -- if you have one-byte per char data that is
mostly european, than latin-1 or latin-9 let you work with it, have it
mostly work, and never crash out with an encoding error.
>
>> > - round-tripping of binary data (at least with Python's
encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
re-encoded to get the same bytes back. You may get garbage, but you won't
get an EncodingError.
>> But what if the format I'm working with specifies another encoding? Am I
supposed to encode all of my Unicode strings in the specified encoding,
then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
really important use case for me.
>
> latin-1 would be only for the special case of mostly-ascii (or true
latin) one-byte-per-char encodings (which is a common use-case in
scientific data sets). I think it has only upside over ascii. It would be a
fine idea to support any one-byte-per-char encoding, too.

In my experience, it has both upside and downside. Silently creating
mojibake is a problem. The process that you described, decoding ANY strings
of bytes as latin-1, can create mojibake. The inverse, encoding then
decoding, may not, but of course the encoding step there does not accept
arbitrary Unicode strings.

> As for external data in utf-8 -- yes that should be dealt with properly
-- either by truly supporting utf-8 internally, or by properly
encoding/decoding when putting it in and  moving it out of an array.
>
> utf-8 is a very important encoding -- I just think it's the wrong one for
the default interplay with python strings.
>
>>  Doing in-memory assignments to a fixed-encoding, fixed-width string
dtype will always have this kind of problem. You should only put up with it
if you have a requirement to write to a format that specifies the width and
the encoding. That specified encoding is frequently not latin-1!
>
> of course not -- if you are writing to a format that specifies a width
and the encoding, you want o use bytes :-) -- or a dtype that is properly
encoding-aware. I was not suggesting that latin-1 be used for arbitrary
bytes -- that is what bytes are for.

Ah, your message was responding to Stephan who questioned why latin-1
should be the default encoding for the `unicode/str`-aware string dtype. It
seemed like you were affirming that latin-1 ought to be that default. It
seems like that is not your position, but you are defending the existence
of a latin-1 dtype for specific uses.

>> I'm happy to consider a latin-1-specific dtype as a second,
workaround-for-specific-applications-only-you-have-been-warned-you're-gonna-get-mojibake
option.
>
> well, it wouldn't create mojibake - anything that went from a python
string to a latin-1 array would be properly encoded in latin-1 -- unless is
came from already corrupted data. but when you have corrupted data, your
only choices are to:
>
>  - raise an error
>  - alter the data (error-"replace")
>  - pass the corrupted data on through.
>
> but it could deal with mojibake -- that's the whole point :-)

You are right that assigning a `unicode/str` object into my latin-1-dtype
array would not create mojibake, but that's not the only way to fill a
numpy array.

In the context of my email, I was responding to a use case being floated
for the latin-1 dtype that was to read existing FITS files that have fields
that are text-ish: plain octets according to the file format standard, but
in practice mostly ASCII with a few sparse high-bit characters typically
from some unspecified iso-8859-* encoding. If that unspecified encoding
wasn't latin-1, then I'm getting mojibake when I read the file (unless if,
happy days, the author of the file was also using latin-1).

I understand that you are proposing a latin-1 dtype in a context with other
dtypes and tools that might make that use of the latin-1 dtype obsolete.
However, there are others who have been proposing just a latin-1 dtype for
this purpose.

Let me make a counter-proposal for your latin-1 dtype (your #2) that might
address your, Thomas's, and Julian's use cases:

2) We want a single-byte-per-character, NULL-terminated string dtype that
can be used to represent mostly-ASCII textish data that may have some
high-bit characters from some 8-bit encoding. It should be able to read
arbitrary bytes (that is, up to the NULL-termination) and write them back
out as the same bytes if unmodified. This lets us read this text from files
where the encoding is unspecified (or is lying about the encoding) into
`unicode/str` objects. The encoding is specified as `ascii` but the
decoding/encoding is done with the `surrogateescape` option so that
high-bit characters are faithfully represented in the `unicode/str` string
but are not erroneously reinterpreted as other characters from an arbitrary
encoding.

I'd even be happy if Julian or someone wants to go ahead and implement this
right now and leave the UTF-8 dtype for a later time.

As long as this ASCII-surrogateescape dtype is not called np.realstring
(it's *really* important to me that the bikeshed not be this color). ;-)

--
Robert Kern
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/7d97105e/attachment-0001.html>