On Tue, 2014-01-21 at 07:48 -0700, Charles R Harris wrote:
On Tue, Jan 21, 2014 at 7:37 AM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <aldcroft@head.cfa.harvard.edu> wrote:
On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote: On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote: > > > > On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <oscar.j.benjamin@gmail.com> > wrote: >> >> >> On Jan 20, 2014 8:35 PM, "Charles R Harris" <charlesr.harris@gmail.com> >> wrote: >> > >> > I think we may want something like PEP 393. The S datatype may be the >> > wrong place to look, we might want a modification of U instead so as to >> > transparently get the benefit of python strings. >> >> The approach taken in PEP 393 (the FSR) makes more sense for str than it >> does for numpy arrays for two reasons: str is immutable and opaque. >> >> Since str is immutable the maximum code point in the string can be >> determined once when the string is created before anything else can get a >> pointer to the string buffer. >> >> Since it is opaque no one can rightly expect it to expose a particular >> binary format so it is free to choose without compromising any expected >> semantics. >> >> If someone can call buffer on an array then the FSR is a semantic change. >> >> If a numpy 'U' array used the FSR and consisted only of ASCII characters >> then it would have a one byte per char buffer. What then happens if you put >> a higher code point in? The buffer needs to be resized and the data copied >> over. But then what happens to any buffer objects or array views? They would >> be pointing at the old buffer from before the resize. Subsequent >> modifications to the resized array would not show up in other views and vice >> versa. >> >> I don't think that this can be done transparently since users of a numpy >> array need to know about the binary representation. That's why I suggest a >> dtype that has an encoding. Only in that way can it consistently have both a >> binary and a text interface. > > > I didn't say we should change the S type, but that we should have something, > say 's', that appeared to python as a string. I think if we want transparent > string interoperability with python together with a compressed > representation, and I think we need both, we are going to have to deal with > the difficulties of utf-8. That means raising errors if the string doesn't > fit in the allotted size, etc. Mind, this is a workaround for the mass of > ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that?
Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way.
Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability. This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits).
I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3 strings. Is there a technical problem with doing basically the same thing for an 's' dtype, but using latin-1 instead of UCS-4?
I think there is a technical problem. We may be able masquerade latin-1 as utf-8 for some subset of characters or fool python 3 in some other way. But in anycase, I think it needs some research to see what the possibilities are.
I am not quite sure, but shouldn't it be even possible to tag on a possible encoding into the metadata of the string dtype and allow this to be set to all 1-byte wide encodings that python understands. If the metadata is not None, all entry points to and from the array (Object->string, string->Object conversions) would then de- or encode using the usual python string de- and encode. Of course it would still be a lot of work, since the string comparisons would need to know about comparing different encodings and dtype equivalence is wrong and all the conversions need to be carefully checked... Most string tools though probably don't care about encoding as long as it is fixed 1-byte width, though one would have to check that they don't lose the encoding information by creating a new "S" array... - Sebastian
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion