[Numpy-discussion] proposal: smaller representation of string arrays

Chris Barker chris.barker at noaa.gov
Tue Apr 25 12:45:20 EDT 2017


On Mon, Apr 24, 2017 at 4:23 PM, Robert Kern <robert.kern at gmail.com> wrote:

> > My question: What are those non-ASCII characters? How often are they
> truly latin-1/9 vs. some other text encoding vs. non-string binary data?
>
> I don't know that we can reasonably make that accounting relevant. Number
> of such characters per byte of text? Number of files with such characters
> out of all existing files?
>

I have a lot of mostly english -- usually not latin-1, but usually mostly
latin-1. -- the non-ascii characters are a handful of accented characters
(usually from spanish, some french), then a few "scientific" characters:
the degree symbol, the "micro" symbol.

I suspect that this is not an unusual pattern for mostly-english scientific
text.

if it's non-string binary data, I know it -- and I'd use a bytes type.

I have two options -- try to detect the encoding properly or use
_something_ and fix it up later. latin-1 is a great choice for the later
option -- most of the text displays fine, and the wrong stuff is untouched,
so I can figure it out.

What I can say with assurance is that every time I have decided, as a
> developer, to write code that just hardcodes latin-1 for such cases, I have
> regretted it. While it's just personal anecdote, I think it's at least
> measuring the right thing. :-)
>

I've had the opposite experience -- so that's two anecdotes :-)

If it were, say, shift-jis, then yes using latin-1 would be a bad idea. but
not really much worse then any other option other than properly decoding
it. IN a way, using latin-1 is like the old py2 string -- it can be used as
text, even if it has arbitrary non-text garbage in it...

-CHB

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170425/f3b6e12f/attachment.html>


More information about the NumPy-Discussion mailing list