I'm a little confused about exactly what you're trying to do.
Actually, *I* am not trying to do anything here -- I'm the one that said computers are so big and fast now that we shouldn't whine about 4 bytes for a character....but this whole conversation started with that request...and I have sympathy .. no one likes to waste memory. After all, numpy support small numeric dtypes, too.
Do you need your in-memory format for this data to be compatible with anything in particular?
Not for this requirement -- binary interchange is another requirement.
If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout?
That's the whole point, yes. Object arrays would be a good solution to the full Unicode problem, not the "why am I wasting so much space when all my data are ascii ?
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays;
Or UCS-4.
I think object arrays would be more problematic for npz storage, and raw "tostring" dumping. (And pickle?) not sure how important that is.
And it would be good to have something that plays well with recarrays
anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type.
Exactly.
-CHB