On Tue, Apr 25, 2017 at 9:27 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Apr 25, 2017 at 5:50 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Tue, Apr 25, 2017 at 3:47 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
Presumably you're getting byte strings (with unknown encoding.
No -- thus is for creating and using mostly ascii string data with python and numpy.
Unknown encoding bytes belong in byte arrays -- they are not text.
You are welcome to try to convince Thomas of that. That is the status quo for him, but he is finding that difficult to work with.
I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.
That sloppiness that you mention is precisely the "unknown encoding" problem. Your previous advocacy has also touched on using latin-1 to decode existing files with unknown encodings as well. If you want to advocate for using latin-1 only for the creation of new data, maybe stop talking about existing files? :-)
Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.
For that use case, the alternative in play isn't ASCII, it's UTF-8, which buys you a whole bunch of useful extra characters. ;-)
There are several use cases being brought forth here. Some involve file reading, some involve file writing, and some involve in-memory manipulation. Whatever change we make is going to impinge somehow on all of the use cases. If all we do is add a latin-1 dtype for people to use to create new in-memory data, then someone is going to use it to read existing data in unknown or ambiguous encodings.
The maximum length of an UTF-8 character is 4 bytes, so we could use that to size arrays by character length. The advantage over UTF-32 is that it is easily compressible, probably by a factor of 4 in many cases. That doesn't solve the in memory problem, but does have some advantages on disk as well as making for easy display. We could compress it ourselves after encoding by truncation.
Note that for terminal display we will want something supported by the system, which is another problem altogether. Let me break the problem down into four categories
Storage -- hdf5, .npy, fits, etc. Display -- ? Modification -- editing Parsing -- fits, etc.
There is probably no one solution that is optimal for all of those.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
quoting Julian ''' I probably have formulated my goal with the proposal a bit better, I am not very interested in a repetition of which encoding to use debate. In the end what will be done allows any encoding via a dtype with metadata like datetime. This allows any codec (including truncated utf8) to be added easily (if python supports it) and allows sidestepping the debate. My main concern is whether it should be a new dtype or modifying the unicode dtype. Though the backward compatibility argument is strongly in favour of adding a new dtype that makes the np.unicode type redundant. ''' I don't quite understand why this discussion goes in a direction of an either one XOR the other dtype. I thought the parameterized 1-byte encoding that Julian mentioned initially sounds useful to me. (I'm not sure I will use it much, but I also don't use float16 ) Josef