[Numpy-discussion] A one-byte string dtype?
Charles R Harris
charlesr.harris at gmail.com
Tue Jan 21 13:14:28 EST 2014
On Tue, Jan 21, 2014 at 11:00 AM, Chris Barker <chris.barker at noaa.gov>wrote:
> A lot of good discussion here -- to much to comment individually, but it
> seems we can boil it down to a couple somewhat distinct proposals:
> 1) a one-byte-per-char dtype:
> This would provide compact, high efficiency storage for common text
> for scientific computing. It is analogous to a lower-precision numeric type
> -- i.e. it could not store any unicode strings -- only the subset that are
> compatible the suggested encoding.
> Suggested encoding: latin-1
> Other options:
> - ascii only.
> - settable to any one-byte per char encoding supported by python
> I like this IFF it's pretty easy, but it may
> add significant complications (and overhead) for comparisons, etc....
> NOTE: This is NOT a way to conflate bytes and text, and not a way to "go
> back to the py2 mojibake hell" -- the goal here is to very clearly have
> this be text data, and have a clearly defined encoding. Which is why we
> can't just use 'S' -- or adapt 'S' to do this. Rather is is a way
> to conveniently and efficiently use numpy for text that is ansi compatible.
> 2) a utf-8 dtype:
> NOTE: this CAN NOT be used in place of (1) above. It is not a one-byte
> per char encoding, so would not snuggly into the numpy data model.
> It would give compact memory use for mostly-ascii data, so that would
> be nice.
> 3) a fully python-3 like ( PEP 393 ) flexible unicode dtype.
> This would get us the advantages of the new py3 unicode model -- compact
> and efficient when it can be, but also supporting all of unicode. Honestly,
> this seems like more work than it's worth to me, at least given the current
> numpy dtype model -- maybe a nice addition to dynd. YOu can, after
> all, simply use an object array with py3 strings in it. Though perhaps
> using the py3 unicode type, but having a dtype that specifically links to
> that, rather than a generic python object would be a good compromise.
> Hmm -- I guess despite what I said, I just write the starting pint for a
Should also mention the reasons for adding a new data type.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion