I'm no unicode expert, but can't we truncate unicode strings so that only valid characters are included?

On Thu, Apr 20, 2017 at 1:32 PM Chris Barker <chris.barker@noaa.gov> wrote:
On Thu, Apr 20, 2017 at 9:47 AM, Anne Archibald <peridot.faceted@gmail.com> wrote:
Is there any reason not to support all Unicode encodings that python does, with the same names and semantics? This would surely be the simplest to understand. 

I think it should support all fixed-length encodings, but not the non-fixed length ones -- they just don't fit well into the numpy data model.
Also, if latin1 is to going to be the only practical 8-bit encoding, maybe check with some non-Western users to make sure it's not going to wreck their lives? I'd have selected ASCII as an encoding to treat specially, if any, because Unicode already does that and the consequences are familiar. (I'm used to writing and reading French without accents because it's passed through ASCII, for example.) 

latin-1 (or latin-9) only makes things better than ASCII -- it buys most of the accented characters for the European language and some symbols that are nice to have (I use the degree symbol a lot...). And it is ASCII compatible -- so there is NO reason to choose ASCII over Latin-*

Which does no good for non-latin languages -- so we need to hear from the community -- is there a substantial demand for a non-latin one-byte per character encoding? 
Variable-length encodings, of which UTF-8 is obviously the one that makes good handling essential, are indeed more complicated. But is it strictly necessary that string arrays hold fixed-length *strings*, or can the encoding length be fixed instead? That is, currently if you try to assign a longer string than will fit, the string is truncated to the number of characters in the data type.

we could do that, yes, but an improperly truncated "string" becomes invalid -- just seems like a recipe for bugs that won't be found in testing.

memory is cheap, compressing is fast -- we really shouldn't get hung up on this!

Note: if you are storing a LOT of text (which I have no idea why you would use numpy anyway), then the memory size might matter, but then semi-arbitrary truncation would probably matter, too.

I expect most text storage in numpy arrays is things like names of datasets, ids, etc, etc -- not massive amounts of text -- so storage space really isn't critical. but having an id or something unexpectedly truncated could be bad.

I think practical experience has shown us that people do not handle "mostly fixed length but once in awhile not" text well -- see the nightmare of UTF-16 on Windows. Granted, utf-8 is multi-byte far more often, so errors are far more likely to be found in tests (why would you use utf-8 is all your data are in ascii???). but still -- why invite hard to test for errors?

Final point -- as Julian suggests, one reason to support utf-8 is for interoperability with other systems -- but that makes errors more of an issue -- if it doesn't pass through the numpy truncation machinery, invalid data could easily get put in a numpy array.


 it would allow UTF-8 to be used just the way it usually is - as an encoding that's almost 8-bit. 

ouch! that perception is the route to way too many errors! it is by no means almost 8-bit, unless your data are almost ascii -- in which case, use latin-1 for pity's sake!

This highlights my point though -- if we support UTF-8, people WILL use it, and only test it with mostly-ascii text, and not find the bugs that will crop up later.

All this said, it seems to me that the important use cases for string arrays involve interaction with existing binary formats, so people who have to deal with such data should have the final say. (My own closest approach to this is the FITS format, which is restricted by the standard to ASCII.)

yup -- not sure we'll get much guidance here though -- netdf does not solve this problem well, either.

But if you are pulling, say, a utf-8 encoded string out of a netcdf file -- it's probably better to pull it out as bytes and pass it through the python decoding/encoding machinery than pasting the bytes straight to a numpy array and hope that the encoding and truncation are correct.



Christopher Barker, Ph.D.

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

NumPy-Discussion mailing list