[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 13:02:17 EDT 2017

Now my proposal for the other use cases:

2) There be some way to store mostly ascii-compatible strings in a single
> byte-per-character array -- so not to be wasting space for "typical
> european-language-oriented data". Note: this should ALSO be compatible with
> Python's character-oriented string model. i.e. a Python String with length
> N will fit into a dtype of size N.
>
> arr = np.array(("this", "that",), dtype=np.single_byte_string)
>
> (name TBD)
>
> and arr[1] would return a python string.
>
> attempting to put in a not-compatible with the encoding String  would
> raise an EncodingError.
>
> This is also a use-case primarily for "casual" users -- but ones concerned
> with the size of the data storage and know that are using european text.
>

more detail elsewhere -- but either ascii with surrageescape or latin-1
always are good options here. I prefer latin-1 (I really see no downside),
but others disagree...

But then we get to:

> 3) dtypes that support storage in particular encodings:
>

We need utf-8. We may need others. We may need a 1-byte per char compact
encoding that isn't close enough to ascii or latin-1 to be useful (say,
shift-jis), And I don't think we are going to come to a consensus on what
"single" encoding to use for 1-byte-per-char.

So really -- going back to Julian's earlier proposal:

dytpe with an encoding specified
"size" in bytes

once defined, numpy would encode/decode to/from python strings "correctly"

we might need "null-terminated utf-8" as a special case.

That would support all the other use cases.

Even the one-byte per char encoding. I"d like to see a clean alias to a
latin-1 encoding, but not a big deal.

That leaves a couple decisions:

 - error out or truncate if the passed-in string is too long?

 - error out or suragateescape if there are invalid bytes in the data?

 - error out or something else if there are characters that can't be
encoded in the specified encoding.

And we still need a proper bytes type:

4) a fixed length bytes dtype -- pretty much what 'S' is now under python
> three -- settable from a bytes or bytearray object (or other memoryview?),
> and returns a bytes object.
>
> You could use astype() to convert between bytes and a specified encoding
> with no change in binary representation. This could be used to store any
> binary data, including encoded text or anything else. this should map
> directly to the Python bytes model -- thus NOT null-terminted.
>
> This is a little different than 'S' behaviour on py3 -- it appears that
> with 'S', a if ALL the trailing bytes are null, then it is truncated, but
> if there is a null byte in the middle, then it is preserved. I suspect that
> this is a legacy from Py2's use of "strings" as both text and binary data.
> But in py3, a "bytes" type should be about bytes, and not text, and thus
> null-values bytes are simply another value a byte can hold.
>

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/d6976a5e/attachment.html>