[Numpy-discussion] proposal: smaller representation of string arrays

Tue Apr 25 12:57:02 EDT 2017

2017-04-25 12:34 GMT-04:00 Chris Barker <chris.barker at noaa.gov>:
> I am totally euro-centric, but as I understand it, that is the whole point
> of the desire for a compact one-byte-per character encoding. If there is a
> strong need for other 1-byte encodings (shift-JIS, maybe?) then maybe we
> should support that. But this all started with "mostly ascii". My take on
> that is:

But Shift-JIS is not one-byte; it's two-byte (unless you allow only
half-width characters and nothing else). :-) In fact legacy CJK
encodings are all nominally two-byte (so that the width of a
character's internal representation matches that of its visual
representation).

>  - filenames
>
> File names are one of the key reasons folks struggled with the python3 data
> model (particularly on *nix) and why 'surrogateescape' was added. It's
> pretty common to store filenames in with our data, and thus in numpy arrays
> -- we need to preserve them exactly and display them mostly right. Again,
> euro-centric, but if you are euro-centric, then latin-1 is a good choice for
> this.

This I don't understand. As far as I can tell non-Western-European
filenames are not unusual. If filenames are a reason, even if you're
euro-centric (think Eastern Europe, say) I don't see how latin1 is a
good choice.

Lurker here, and I haven't touched numpy in ages. So I might be
blurting out nonsense.

-- 
Ambrose Li // http://o.gniw.ca / http://gniw.ca
If you saw this on CE-L: You do not need my permission to quote
me, only proper attribution. Always cite your sources, even if
you have to anonymize and/or cite it as "personal communication".