[Numpy-discussion] proposal: smaller representation of string arrays

Anne Archibald peridot.faceted at gmail.com
Tue Apr 25 13:12:54 EDT 2017


On Tue, Apr 25, 2017 at 6:05 PM Chris Barker <chris.barker at noaa.gov> wrote:

> Anyway, I think I made the mistake of mingling possible solutions in with
> the use-cases, so I'm not sure if there is any consensus on the use cases
> -- which I think we really do need to nail down first -- as Robert has made
> clear.
>

I would make my use-cases more user-specific:

1) User wants an array with numpy indexing tricks that can hold python
strings but doesn't care about the underlying representation.
-> Solvable with object arrays, or Robert's string-specific object arrays;
underlying representation is python objects on the heap. Sadly UCS-4, so
zillions are going to be a memory problem.

2) User has to deal with fixed-width binary data from an external
program/library and wants to see it as python strings. This may be
systematically encoded in a known encoding (e.g. HDF5's
fixed-storage-length zero-padded UTF-8 strings, spec-observing FITS'
zero-padded ASCII) or
ASCII-with-exceptions-and-the-user-is-supposed-to-know (e.g. spec-violating
FITS files with zero-padded latin-9, koi8-r, cp1251, or whatever). Length
may be signaled by null termination, null padding, or space padding.
-> Solvable with a fixed-storage-size encoded-string dtype, as long as it
has a parameter for how length is signaled. Python tricks for dealing with
wrong or unknown encodings can make bogus data manageable.

3) User has to deal with fixed-width binary data from an external
program/library that really is binary bytes.
-> Solvable with a dtype that returns fixed-length byte strings.

4) User has a stupendous number (billions) of short strings which are
mostly but not entirely ASCII and wants to manipulate them as strings.
-> Not sure how to solve this. Maybe an object array with byte strings for
storage and encoding information in the dtype, allowing transparent
decoding? Or a fixed-storage-size array with a one-byte encoding that can
cope with all the characters the user will ever want to use?

5) User has a bunch of mystery-encoding strings(?) and wants to store them
in a numpy array.
-> If they're python strings already, no further harm is done by treating
this as case 1 when in python-land. If they need to be in fixed-width
fields for communication with an external program or library, this puts us
in case 2, unknown encoding variety; user will have to pick an encoding
that the external program is likely to be able to cope with; this may be
the one that originated the mystery strings in the first place.

6) User has python strings and wants to store them in non-object numpy
arrays for some reason but doesn't care about the actual memory layout.
-> Solvable with the current setup; fixed-width UCS-4 fields, padded with
Unicode NULL. Happily, this comes for free from arbitrary-encoding
fixed-storage-size dtypes, though a friendlier interface might be nice.
Also allows people to use UCS-2 or ASCII if they know their strings fit.

7) User has data in one binary format and it needs to go into another, with
perhaps casual inspection while in python-land. Such data is mostly ASCII
but might contain mystery characters; presenting gobbledygook to the user
is okay as long as the characters are output intact.
-> Reading and writing as a fixed-width one-byte encoding, preferably one
resembling the one the data is actually in, should work here. UTF-8 is
likely to mangle the data; ASCII-with-surrogateescape might do okay. The
key thing here is that both input and output files will have their own ways
of specifying string length and their own storage specifiers; user must
know these, and someone has to know and specify what to do with strings
that don't fit. Simple truncation will mangle UTF-8 if it is not known to
be UTF-8, but there's maybe not much that can be done about that.

I guess my point is that a use case should specify:
* Where does the data come from (i.e. in what format)?
* Are there memory constraints in the storage format?
* How should access look to the user? In particular, what should misencoded
data look like?
* Where does the data need to go?

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/f879d23f/attachment-0001.html>


More information about the NumPy-Discussion mailing list