[Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker - NOAA Federal chris.barker at noaa.gov
Tue Apr 25 18:47:46 EDT 2017

A compact dtype for mostly-ascii text:

I'm a little confused about exactly what you're trying to do.

Actually, *I* am not trying to do anything here -- I'm the one that said
computers are so big and fast now that we shouldn't whine about 4 bytes for
a character....but this whole conversation started with that request...and
I have sympathy .. no one likes to waste memory. After all, numpy support
small numeric dtypes, too.

Do you need your in-memory format for this data to be compatible with
anything in particular?

Not for this requirement -- binary interchange is another requirement.

If you're not reading or writing files in this format, then it's just a
matter of storing a whole bunch of things that are already python strings
in memory. Could you use an object array? Or do you have an enormous number
so that you need a more compact, fixed-stride memory layout?

That's the whole point, yes. Object arrays would be a good solution to the
full Unicode problem, not the "why am I wasting so much space when all my
data are ascii ?

Presumably you're getting byte strings (with  unknown encoding.

No -- thus is for creating and using mostly ascii string data with python
and numpy.

Unknown encoding bytes belong in byte arrays -- they are not text.

I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with
a few extra characters" data. With all the sloppiness over the years, there
are way to many files like that.

Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.

If your question is "what should numpy's default string dtype be?", well,
maybe default to object arrays;

Or UCS-4.

I think object arrays would be more problematic for npz storage, and raw
"tostring" dumping. (And pickle?) not sure how important that is.

And it would be good to have something that plays well with recarrays

anyone who just has a bunch of python strings to store is unlikely to be
surprised by this. Someone with more specific needs will choose a more
specific - that is, not default - string data type.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/f2189e04/attachment.html>

More information about the NumPy-Discussion mailing list