[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 13:54:39 EDT 2017

On Tue, Apr 25, 2017 at 6:36 PM Chris Barker <chris.barker at noaa.gov> wrote:

>
> This is essentially my rant about use-case (2):
>
> A compact dtype for mostly-ascii text:
>

I'm a little confused about exactly what you're trying to do. Do you need
your in-memory format for this data to be compatible with anything in
particular?

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?

Presumably you're getting byte strings (with no NULLs) from somewhere and
need to store them in this memory structure in a way that makes them as
usable as possible in spite of their unknown encoding. Presumably the thing
to do is just copy them in there as-is and then use .astype to arrange for
python to decode them when accessed. So this is precisely the problem of
"how should I decode random byte strings?" that python has been struggling
with. My impression is that the community has established that there's no
one solution that makes everyone happy, but that most people can cope with
some combination of picking a one-byte encoding,
ascii-with-surrogateescapes, zapping bogus characters, and giving wrong
results. But I think that all the standard python alternatives are needed,
in general, and in terms of interpreting numpy arrays full of bytes.
Clearly your preferred solution is .astype("string[latin-9]"), but just as
clearly that's not going to work for everyone.

If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays; anyone who just has a bunch of python
strings to store is unlikely to be surprised by this. Someone with more
specific needs will choose a more specific - that is, not default - string
data type.

Anne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/8eec2db9/attachment-0001.html>