Mailman 3 Re: [Numpy-discussion] A one-byte string dtype? - NumPy-Discussion

newer
Re: [Numpy-discussion] A one-byte...

Re: [Numpy-discussion] A one-byte string dtype?

older
Re: [Numpy-discussion] A one-byte...

David Goldsmith

21 Jan 2014 21 Jan '14

5:28 p.m.

Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? DG

Attachments:

attachment.htm (text/html — 315 bytes)

Show replies by date

Nathaniel Smith

21 Jan 21 Jan

5:35 p.m.

New subject: A one-byte string dtype?

On 21 Jan 2014 17:28, "David Goldsmith" wrote:

...

Am I the only one who feels that this (very important--I'm being sincere,

not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki? Sounds plausible, perhaps you could write up such a page? -n

Chris Barker

5:46 p.m.

New subject: A one-byte string dtype?

On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith wrote:

...

Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki?

Or maybe a NEP? https://github.com/numpy/numpy/tree/master/doc/neps sorry -- really swamped this week, so I won't be writing it... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris Barker

6 p.m.

New subject: A one-byte string dtype?

A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals: 1) a one-byte-per-char dtype: This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc.... NOTE: This is NOT a way to conflate bytes and text, and not a way to "go back to the py2 mojibake hell" -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible. 2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice. 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise. Hmm -- I guess despite what I said, I just write the starting pint for a NEP... (or two, actually...) -Chris On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker wrote:

...

On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith wrote:

...
Am I the only one who feels that this (very important--I'm being sincere, not sarcastic) thread has matured and specialized enough to warrant it's own home on the Wiki?

Or maybe a NEP?

https://github.com/numpy/numpy/tree/master/doc/neps

sorry -- really swamped this week, so I won't be writing it...

-Chris

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Charles R Harris

6:14 p.m.

New subject: A one-byte string dtype?

On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker wrote:

...

A lot of good discussion here -- to much to comment individually, but it seems we can boil it down to a couple somewhat distinct proposals:

1) a one-byte-per-char dtype:

This would provide compact, high efficiency storage for common text for scientific computing. It is analogous to a lower-precision numeric type -- i.e. it could not store any unicode strings -- only the subset that are compatible the suggested encoding. Suggested encoding: latin-1 Other options: - ascii only. - settable to any one-byte per char encoding supported by python I like this IFF it's pretty easy, but it may add significant complications (and overhead) for comparisons, etc....

NOTE: This is NOT a way to conflate bytes and text, and not a way to "go back to the py2 mojibake hell" -- the goal here is to very clearly have this be text data, and have a clearly defined encoding. Which is why we can't just use 'S' -- or adapt 'S' to do this. Rather is is a way to conveniently and efficiently use numpy for text that is ansi compatible.

2) a utf-8 dtype: NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte per char encoding, so would not snuggly into the numpy data model. It would give compact memory use for mostly-ascii data, so that would be nice.

3) a fully python-3 like ( PEP 393 ) flexible unicode dtype. This would get us the advantages of the new py3 unicode model -- compact and efficient when it can be, but also supporting all of unicode. Honestly, this seems like more work than it's worth to me, at least given the current numpy dtype model -- maybe a nice addition to dynd. YOu can, after all, simply use an object array with py3 strings in it. Though perhaps using the py3 unicode type, but having a dtype that specifically links to that, rather than a generic python object would be a good compromise.

Hmm -- I guess despite what I said, I just write the starting pint for a NEP...

Should also mention the reasons for adding a new data type. <snip> Chuck

3747

Age (days ago)

3747

Last active (days ago)

List overview

Download

4 comments

4 participants

participants (4)

Charles R Harris
Chris Barker
David Goldsmith
Nathaniel Smith

Re: [Numpy-discussion] A one-byte string dtype?

David Goldsmith

Nathaniel Smith

Chris Barker

Chris Barker

Charles R Harris

tags

participants (4)