[Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker chris.barker at noaa.gov
Mon Apr 24 17:00:13 EDT 2017


On Mon, Apr 24, 2017 at 11:36 AM, Robert Kern <robert.kern at gmail.com> wrote:

> > I agree -- it is a VERY common case for scientific data sets. But a
> one-byte-per-char encoding would handle it nicely, or UCS-4 if you want
> Unicode. The wasted space is not that big a deal with short strings...
>
> Unless if you have hundreds of billions of them.
>

Which is why a one-byte-per char encoding is a good idea.

Solve the HDF5 problem (i.e. fixed-length UTF-8 strings)
>

I agree-- binary compatibility with utf-8 is a core use case -- though is
it so bad to go through python's encoding/decoding machinery to so it? Do
numpy arrays HAVE to be storing utf-8 natively?


> or leave it be until someone else is willing to solve that problem. I
> don't think we're at the bikeshedding stage yet; we're still disagreeing
> about fundamental requirements.
>

yeah -- though I've seen projects get stuck in the sorting out what to do,
so nothing gets done stage before -- I don't want Julian to get too
frustrated and end up doing nothing.

So here I'll lay out what I think are the fundamental requirements:

1) The default behaviour for numpy arrays of strings is compatible with
Python3's string model: i.e. fully unicode supporting, and with a character
oriented interface. i.e. if you do:

arr = np.array(("this", "that",))

you get an array that can store ANY unicode string with 4 or less characters

and arr[1] will return a native Python string object.

2) There be some way to store mostly ascii-compatible strings in a single
byte-per-character array -- so not be wasting space for "typical
european-oriented data".

arr = np.array(("this", "that",), dtype=np.single_byte_string)

(name TBD)

and arr[1] would return a python string.

attempting to put in a not-compatible with the encoding string in would
raise an Encoding Error.

I highly recommend that (SO 8859-15 ( latin-9 or latin-1)  be the encoding
in this case.

3) There be a dtype that could store strings in null-terminated utf-8
binary format -- for interchange with other systems (netcdf, HDF, others???)

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
three -- settable from a bytes or bytearray object, and returns a bytes
object.
 - you could use astype() to convert between bytes and a specified encoding
with no change in binary representation.

2) and 3) could be fully covered by a dtype with a settable encoding that
might as well support all python built-in encodings -- though I think an
alias to the common cases would be good -- latin, utf-8. If so, the length
would have to be specified in bytes.

1) could be covered with the existing 'U': type - only downside being some
wasted space -- or with a pointer to a python string dtype -- which would
also waste space, though less for long-ish strings, and maybe give us some
better access to the nifty built-in string features.

> +1.  The key point is that there is a HUGE amount of legacy science data
> in the form of FITS (astronomy-specific binary file format that has been
> the primary file format for 20+ years) and HDF5 which uses a character data
> type to store data which can be bytes 0-255.  Getting an decoding/encoding
> error when trying to deal with these datasets is a non-starter from my
> perspective.


That says to me that these are properly represented by `bytes` objects, not
> `unicode/str` objects encoding to and decoding from a hardcoded latin-1
> encoding.


Well, yes -- BUT:  That strictness in python3 -- "data is either text or
bytes, and text in an unknown (or invalid) encoding HAVE to be bytes" bit
Python3 is the butt for a long time. Folks that deal in the messy real
world of binary data that is kinda-mostly text, but may have a bit of
binary data, or be in an unknown encoding, or be corrupted were very, very
adamant about how this model DID NOT work for them. Very influential people
were seriously critical of python 3. Eventually, py3 added bytes string
formatting, surrogate_escape, and other features that facilitate working
with messy almost text.

Practicality beats purity -- if you have one-byte per char data that is
mostly european, than latin-1 or latin-9 let you work with it, have it
mostly work, and never crash out with an encoding error.

> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
> But what if the format I'm working with specifies another encoding? Am I
> supposed to encode all of my Unicode strings in the specified encoding,
> then decode as latin-1 to assign into my array? HDF5's UTF-8 arrays are a
> really important use case for me.


latin-1 would be only for the special case of mostly-ascii (or true latin)
one-byte-per-char encodings (which is a common use-case in scientific data
sets). I think it has only upside over ascii. It would be a fine idea to
support any one-byte-per-char encoding, too.

As for external data in utf-8 -- yes that should be dealt with properly --
either by truly supporting utf-8 internally, or by properly
encoding/decoding when putting it in and  moving it out of an array.

utf-8 is a very important encoding -- I just think it's the wrong one for
the default interplay with python strings.

 Doing in-memory assignments to a fixed-encoding, fixed-width string dtype
> will always have this kind of problem. You should only put up with it if
> you have a requirement to write to a format that specifies the width and
> the encoding. That specified encoding is frequently not latin-1!
>

of course not -- if you are writing to a format that specifies a width and
the encoding, you want o use bytes :-) -- or a dtype that is properly
encoding-aware. I was not suggesting that latin-1 be used for arbitrary
bytes -- that is what bytes are for.

> - round-tripping of binary data (at least with Python's
> encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and
> re-encoded to get the same bytes back. You may get garbage, but you won't
> get an EncodingError.
>


> But what if the format I'm working with specifies another encoding? Am I
> supposed to encode all of my Unicode strings in the specified encoding,
> then decode as latin-1 to assign into my array?


of course not -- see above.

I'm happy to consider a latin-1-specific dtype as a second,
> workaround-for-specific-applications-only-you-have-been-
> warned-you're-gonna-get-mojibake option.


well, it wouldn't create mojibake - anything that went from a python string
to a latin-1 array would be properly encoded in latin-1 -- unless is came
from already corrupted data. but when you have corrupted data, your only
choices are to:

 - raise an error
 - alter the data (error-"replace")
 - pass the corrupted data on through.

but it could deal with mojibake -- that's the whole point :-)


> It should not be *the* Unicode string dtype (i.e. named np.realstring or
> np.unicode as in the original proposal).


God no -- sorry if it looked like I was suggesting that. I only suggest
that it might be *the* one-byte-per-char string type

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170424/2fc20ea1/attachment-0001.html>


More information about the NumPy-Discussion mailing list