On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <
wrote:
On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> wrote:
I think we may want something like PEP 393. The S datatype may be the wrong place to look, we might want a modification of U instead so as
to
transparently get the benefit of python strings.
The approach taken in PEP 393 (the FSR) makes more sense for str than it does for numpy arrays for two reasons: str is immutable and opaque.
Since str is immutable the maximum code point in the string can be determined once when the string is created before anything else can get a pointer to the string buffer.
Since it is opaque no one can rightly expect it to expose a particular binary format so it is free to choose without compromising any expected semantics.
If someone can call buffer on an array then the FSR is a semantic change.
If a numpy 'U' array used the FSR and consisted only of ASCII characters then it would have a one byte per char buffer. What then happens if you
a higher code point in? The buffer needs to be resized and the data copied over. But then what happens to any buffer objects or array views? They would be pointing at the old buffer from before the resize. Subsequent modifications to the resized array would not show up in other views and vice versa.
I don't think that this can be done transparently since users of a numpy array need to know about the binary representation. That's why I suggest a dtype that has an encoding. Only in that way can it consistently have both a binary and a text interface.
I didn't say we should change the S type, but that we should have something, say 's', that appeared to python as a string. I think if we want
oscar.j.benjamin@gmail.com> put transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
Though, IMO any new dtype here would need a cleanup of the dtype code first so that it doesn't require yet more massive special cases all over umath.so.
Worth thinking about. As another alternative, what is the minimum we need to make a restricted encoding, say latin-1, appear transparently as a unicode string to python? I know the python folks don't like this much, but I suspect something along that line will eventually be required for the http folks. Chuck