On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <charlesr.harris@gmail.com
wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <
charlesr.harris@gmail.com>
wrote:
I think we may want something like PEP 393. The S datatype may be
wrong place to look, we might want a modification of U instead so as to transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you put a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> the transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that? - Tom
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion