Re: [Numpy-discussion] A one-byte string dtype?
Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? DG
On 21 Jan 2014 17:28, "David Goldsmith" <d.l.goldsmith@gmail.com> wrote:
Am I the only one who feels that this (very important--I'm being sincere,
not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Sounds plausible, perhaps you could write up such a page? -n
On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith <d.l.goldsmith@gmail.com>wrote:
Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki?
Or maybe a NEP? https://github.com/numpy/numpy/tree/master/doc/neps sorry -- really swamped this week, so I won't be writing it... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals: 1) a one-byte-per-char dtype: This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc.... NOTE: This is NOT a way to conflate bytes and text, and not a way to "go back to the py2 mojibake hell" -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible. 2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice. 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise. Hmm -- I guess despite what I said, I just write the starting pint for a NEP... (or two, actually...) -Chris On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker <chris.barker@noaa.gov> wrote:
On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith <d.l.goldsmith@gmail.com>wrote:
Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki?
Or maybe a NEP?
https://github.com/numpy/numpy/tree/master/doc/neps
sorry -- really swamped this week, so I won't be writing it...
-Chris
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker <chris.barker@noaa.gov>wrote:
A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals:
1) a one-byte-per-char dtype:
This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc....
NOTE: This is NOT a way to conflate bytes and text, and not a way to "go back to the py2 mojibake hell" -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible.
2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice.
3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise.
Hmm -- I guess despite what I said, I just write the starting pint for a NEP...
Should also mention the reasons for adding a new data type. <snip> Chuck
participants (4)
-
Charles R Harris
-
Chris Barker
-
David Goldsmith
-
Nathaniel Smith