[Numpy-discussion] proposal: smaller representation of string arrays
chris.barker at noaa.gov
Thu Apr 20 13:28:14 EDT 2017
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted at gmail.com>
> Is there any reason not to support all Unicode encodings that python does,
> with the same names and semantics? This would surely be the simplest to
I think it should support all fixed-length encodings, but not the non-fixed
length ones -- they just don't fit well into the numpy data model.
> Also, if latin1 is to going to be the only practical 8-bit encoding, maybe
> check with some non-Western users to make sure it's not going to wreck
> their lives? I'd have selected ASCII as an encoding to treat specially, if
> any, because Unicode already does that and the consequences are familiar.
> (I'm used to writing and reading French without accents because it's passed
> through ASCII, for example.)
latin-1 (or latin-9) only makes things better than ASCII -- it buys most of
the accented characters for the European language and some symbols that are
nice to have (I use the degree symbol a lot...). And it is ASCII compatible
-- so there is NO reason to choose ASCII over Latin-*
Which does no good for non-latin languages -- so we need to hear from the
community -- is there a substantial demand for a non-latin one-byte per
> Variable-length encodings, of which UTF-8 is obviously the one that makes
> good handling essential, are indeed more complicated. But is it strictly
> necessary that string arrays hold fixed-length *strings*, or can the
> encoding length be fixed instead? That is, currently if you try to assign a
> longer string than will fit, the string is truncated to the number of
> characters in the data type.
we could do that, yes, but an improperly truncated "string" becomes invalid
-- just seems like a recipe for bugs that won't be found in testing.
memory is cheap, compressing is fast -- we really shouldn't get hung up on
Note: if you are storing a LOT of text (which I have no idea why you would
use numpy anyway), then the memory size might matter, but then
semi-arbitrary truncation would probably matter, too.
I expect most text storage in numpy arrays is things like names of
datasets, ids, etc, etc -- not massive amounts of text -- so storage space
really isn't critical. but having an id or something unexpectedly truncated
could be bad.
I think practical experience has shown us that people do not handle "mostly
fixed length but once in awhile not" text well -- see the nightmare of
UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors
are far more likely to be found in tests (why would you use utf-8 is all
your data are in ascii???). but still -- why invite hard to test for errors?
Final point -- as Julian suggests, one reason to support utf-8 is for
interoperability with other systems -- but that makes errors more of an
issue -- if it doesn't pass through the numpy truncation machinery, invalid
data could easily get put in a numpy array.
it would allow UTF-8 to be used just the way it usually is - as an
> encoding that's almost 8-bit.
ouch! that perception is the route to way too many errors! it is by no
means almost 8-bit, unless your data are almost ascii -- in which case, use
latin-1 for pity's sake!
This highlights my point though -- if we support UTF-8, people WILL use it,
and only test it with mostly-ascii text, and not find the bugs that will
crop up later.
All this said, it seems to me that the important use cases for string
> arrays involve interaction with existing binary formats, so people who have
> to deal with such data should have the final say. (My own closest approach
> to this is the FITS format, which is restricted by the standard to ASCII.)
yup -- not sure we'll get much guidance here though -- netdf does not solve
this problem well, either.
But if you are pulling, say, a utf-8 encoded string out of a netcdf file --
it's probably better to pull it out as bytes and pass it through the python
decoding/encoding machinery than pasting the bytes straight to a numpy
array and hope that the encoding and truncation are correct.
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion