[Numpy-discussion] A one-byte string dtype?

Chris Barker chris.barker at noaa.gov
Tue Jan 21 13:00:19 EST 2014


A  lot of good discussion here -- to much to comment individually, but it
seems we can boil it down to a couple somewhat distinct proposals:

1) a one-byte-per-char dtype:

This would provide compact, high efficiency storage for common text
for scientific computing. It is analogous to a lower-precision numeric type
-- i.e. it could not store any unicode strings -- only the subset that are
compatible the suggested encoding.
 Suggested encoding: latin-1
 Other options:
     - ascii only.
     - settable to any one-byte per char encoding supported by python
        I like this IFF it's pretty easy, but it may
add significant complications (and overhead) for comparisons, etc....

NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
back to the py2 mojibake hell" -- the goal here is to very clearly have
this be text data, and have a clearly defined encoding. Which is why we
can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
to conveniently and efficiently use numpy for text that is ansi compatible.

2) a utf-8 dtype:
    NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
per char encoding, so would not snuggly into the numpy data model.
   It would give compact memory use for mostly-ascii data, so that would be
nice.

3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
  This would get us the advantages of the new py3 unicode model -- compact
and efficient when it can be, but also supporting all of unicode. Honestly,
this seems like more work than it's worth to me, at least given the current
numpy dtype model -- maybe a nice addition to dynd. YOu can, after
all, simply use an object array with py3 strings in it. Though perhaps
using the py3 unicode type, but having a dtype that specifically links to
that, rather than a generic python object would be a good compromise.


Hmm -- I guess despite what I said, I just write the starting pint for a
NEP...

(or two, actually...)

-Chris

















On Tue, Jan 21, 2014 at 9:46 AM, Chris Barker <chris.barker at noaa.gov> wrote:

> On Tue, Jan 21, 2014 at 9:28 AM, David Goldsmith <d.l.goldsmith at gmail.com>wrote:
>
>>
>> Am I the only one who feels that this (very important--I'm being sincere,
>> not sarcastic) thread has matured and specialized enough to warrant it's
>> own home on the Wiki?
>>
>
> Or  maybe a NEP?
>
> https://github.com/numpy/numpy/tree/master/doc/neps
>
> sorry -- really swamped this week, so I won't be writing it...
>
> -Chris
>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140121/7044e972/attachment.html>


More information about the NumPy-Discussion mailing list