[Numpy-discussion] proposal: smaller representation of string arrays
robert.kern at gmail.com
Tue Apr 25 19:50:05 EDT 2017
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <
chris.barker at noaa.gov> wrote:
>> Presumably you're getting byte strings (with unknown encoding.
> No -- thus is for creating and using mostly ascii string data with python
> Unknown encoding bytes belong in byte arrays -- they are not text.
You are welcome to try to convince Thomas of that. That is the status quo
for him, but he is finding that difficult to work with.
> I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii,
with a few extra characters" data. With all the sloppiness over the years,
there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding"
problem. Your previous advocacy has also touched on using latin-1 to decode
existing files with unknown encodings as well. If you want to advocate for
using latin-1 only for the creation of new data, maybe stop talking about
existing files? :-)
> Note: the primary use-case I have in mind is working with ascii text in
numpy arrays efficiently-- folks have called for that. All I'm saying is
use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which
buys you a whole bunch of useful extra characters. ;-)
There are several use cases being brought forth here. Some involve file
reading, some involve file writing, and some involve in-memory
manipulation. Whatever change we make is going to impinge somehow on all of
the use cases. If all we do is add a latin-1 dtype for people to use to
create new in-memory data, then someone is going to use it to read existing
data in unknown or ambiguous encodings.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion