[Numpy-discussion] proposal: smaller representation of string arrays

Wed Apr 26 12:28:48 EDT 2017

> > I DO recommend Latin-1 As a default encoding ONLY for  "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
>
> That sloppiness that you mention is precisely the "unknown encoding" problem.

Exactly -- but from a practicality beats purity perspective, there is
a difference between "I have no idea whatsoever" and "I know it is
mostly ascii, and European, but there are some extra characters in
there"

Latin-1 had proven very useful for that case.

I suppose in most cases ascii with errors='replace' would be a good
choice, but I'd still rather not throw out potentially useful
information.

> Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)

Yeah, I've been very unfocused in this discussion -- sorry about that.

> > Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
>
> For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)

UTF-8 does not match the character-oriented Python text model. Plenty
of people argue that that isn't the "correct" model for Unicode text
-- maybe so, but it is the model python 3 has chosen. I wrote a much
longer rant about that earlier.

So I think the easy to access, and particularly defaults, numpy string
dtypes should match it.

It's become clear in this discussion that there is s strong desire to
support a numpy dtype that stores text in particular binary formats
(I.e. Encodings). Rather than choose one or two, we might as well
support all encodings supported by python.

In that case, we'll have utf-8 for those that know they want that, and
we'll have latin-1 for those that incorrectly think they want that :-)

So what remains is to decide is implementation, syntax, and defaults.

Let's keep in mind that most of us on this list, and in this
discussion, are the folks that write interface code and the like. But
most numpy users are not as tuned in to the internals. So defaults
should be set to best support the more "naive" user.

> . If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.

If we add every encoding known to man someone is going to use Latin-1
to read unknown encodings. Indeed, as we've all pointed out, there is
no correct encoding with which to read unknown encodings.

Frankly, if we have UTF-8 under the hood, I think people are even MORE
likely to use it inappropriately-- it's quite scary how many people
think UTF-8 == Unicode, and think all you need to do is "use utf-8",
and you don't need to change any of the rest of your code. Oh, and
once you've done that, you can use your existing ASCII-only tests and
think you have a working application :-)

-CHB