<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jan 21, 2014 at 8:55 AM, Charles R Harris <span dir="ltr"><<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div class="h5">On Tue, Jan 21, 2014 at 5:54 AM, Aldcroft, Thomas <span dir="ltr"><<a href="mailto:aldcroft@head.cfa.harvard.edu" target="_blank">aldcroft@head.cfa.harvard.edu</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div>On Mon, Jan 20, 2014 at 6:12 PM, Charles R Harris <span dir="ltr"><<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div>On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <span dir="ltr"><<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div>On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <span dir="ltr"><<a href="mailto:njs@pobox.com" target="_blank">njs@pobox.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris<br>

<div><div><<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>> wrote:<br>

><br>

><br>

><br>

> On Mon, Jan 20, 2014 at 2:27 PM, Oscar Benjamin <<a href="mailto:oscar.j.benjamin@gmail.com" target="_blank">oscar.j.benjamin@gmail.com</a>><br>

> wrote:<br>

>><br>

>><br>

>> On Jan 20, 2014 8:35 PM, "Charles R Harris" <<a href="mailto:charlesr.harris@gmail.com" target="_blank">charlesr.harris@gmail.com</a>><br>

>> wrote:<br>

>> ><br>

>> > I think we may want something like PEP 393. The S datatype may be the<br>

>> > wrong place to look, we might want a modification of U instead so as to<br>

>> > transparently get the benefit of python strings.<br>

>><br>

>> The approach taken in PEP 393 (the FSR) makes more sense for str than it<br>

>> does for numpy arrays for two reasons: str is immutable and opaque.<br>

>><br>

>> Since str is immutable the maximum code point in the string can be<br>

>> determined once when the string is created before anything else can get a<br>

>> pointer to the string buffer.<br>

>><br>

>> Since it is opaque no one can rightly expect it to expose a particular<br>

>> binary format so it is free to choose without compromising any expected<br>

>> semantics.<br>

>><br>

>> If someone can call buffer on an array then the FSR is a semantic change.<br>

>><br>

>> If a numpy 'U' array used the FSR and consisted only of ASCII characters<br>

>> then it would have a one byte per char buffer. What then happens if you put<br>

>> a higher code point in? The buffer needs to be resized and the data copied<br>

>> over. But then what happens to any buffer objects or array views? They would<br>

>> be pointing at the old buffer from before the resize. Subsequent<br>

>> modifications to the resized array would not show up in other views and vice<br>

>> versa.<br>

>><br>

>> I don't think that this can be done transparently since users of a numpy<br>

>> array need to know about the binary representation. That's why I suggest a<br>

>> dtype that has an encoding. Only in that way can it consistently have both a<br>

>> binary and a text interface.<br>

><br>

><br>

> I didn't say we should change the S type, but that we should have something,<br>

> say 's', that appeared to python as a string. I think if we want transparent<br>

> string interoperability with python together with a compressed<br>

> representation, and I think we need both, we are going to have to deal with<br>

> the difficulties of utf-8. That means raising errors if the string doesn't<br>

> fit in the allotted size, etc. Mind, this is a workaround for the mass of<br>

> ascii data that is already out there, not a substitute for 'U'.<br>

<br>

</div></div>If we're going to be taking that much trouble, I'd suggest going ahead<br>

and adding a variable-length string type (where the array itself<br>

contains a pointer to a lookaside buffer, maybe with an optimization<br>

for stashing short strings directly). The fixed-length requirement is<br>

pretty onerous for lots of applications (e.g., pandas always uses<br>

dtype="O" for strings -- and that might be a good workaround for some<br>

people in this thread for now). The use of a lookaside buffer would<br>

also make it practical to resize the buffer when the maximum code<br>

point changed, for that matter...<br></blockquote></div></div></div></div></div></blockquote><div><br></div></div></div><div>The more I think about it, the more I think we may need to do that. Note that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.<br>


</div></div></div></div></blockquote><div><br></div></div></div><div>Is this approach intended to be in *addition to* the latin-1 "s" type originally proposed by Chris, or *instead of* that?</div><div><br></div>


</div></div></div></blockquote><div><br></div></div></div><div>Well, that's open for discussion. The problem is to have something that is both compact (latin-1) and interoperates transparently with python 3 strings (utf-8). A latin-1 type would be easier to implement and would probably be a better choice for something available in both python 2 and python 3, but unless the python 3 developers come up with something clever I don't  see how to make it behave transparently as a string in python 3. OTOH, it's not clear to me how to make utf-8 operate transparently with python 2 strings, especially as the unicode representation choices in python 2 are ucs-2 or ucs-4 and the python 3 work adding utf-16 and utf-8 is unlikely to be backported. The problem may be unsolvable in a completely satisfactory way.<br>


</div></div></div></div></blockquote><div><br></div><div>Since it's open for discussion, I'll put in my vote for implementing the easier latin-1 version in the short term to facilitate Python 2 / 3 interoperability.  This would solve my use-case (giga-rows of short fixed length strings), and presumably allow things like memory mapping of large data files (like for FITS files in astropy.io.fits).</div>


<div><br></div><div>I don't have a clue how the current 'U' dtype works under the hood, but from my user perspective it seems to work just fine in terms of interacting with Python 3 strings.  Is there a technical problem with doing basically the same thing for an 's' dtype, but using latin-1 instead of UCS-4?</div>


<div><br></div><div>Thanks,</div><div>Tom</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">


<div>

<br></div><div>Chuck  <br></div></div></div></div>

<br>_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

<br></blockquote></div><br></div></div>