<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 17, 2014 at 5:30 PM, Chris Barker <span dir="ltr"><<a href="mailto:chris.barker@noaa.gov" target="_blank">chris.barker@noaa.gov</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Folks,<div><br></div><div>I've been blathering away on the related threads a lot -- sorry if it's too much. It's gotten a bit tangled up, so I thought I'd start a new one to address this one question (i.e. dont bring up genfromtext here):</div>


<div><br></div><div>Would it be a good thing for numpy to have a one-byte--per-character string type?</div><div><br></div><div>We did have that with the 'S' type in py2, but the changes in py3 have made it not quite the right thing. And it appears that enough people use 'S' in py3 to mean 'bytes', so that we can't change that now.</div>


<div><br></div><div>The only difference may be that 'S' currently auto translates to a bytes object, resulting in things like:<br></div><div><br></div><div><div>np.array(['some text',],  dtype='S')[0] == 'some text'</div>


<div><br></div><div>yielding False on Py3. And you can't do all the usual text stuff with the resulting bytes object, either. (and it probably used the default encoding to generate the bytes, so will barf on some inputs, though that may be unavoidable.) So you need to decode the bytes that are given back, and now that I think about it, I have no idea what encoding you'd need to use in the general case.</div>


<div><br></div><div>So the correct solution is (particularly on py3) to use the 'U' (unicode) dtype for text in numpy arrays.</div><div><br></div><div>However, the 'U' dtype is 4 bytes per character, and that may be "too big" for some use-cases. And there is a lot of text in scientific data sets that are pure ascii, or at least some 1-byte-per-character encoding.</div>


<div><br></div><div>So, in the spirit of having multiple numeric types that use different amounts of memory, and can hold different ranges of values, a one-byte-per character dtype would be nice:</div><div>


<br></div><div>(note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally don't think that's worth it, but maybe that's because I'm an english speaker...)</div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div dir="ltr"><div><div><br></div>

<div>

It could use the 's' (lower-case s) type identifier.</div><div><br></div><div>For passing to/from python built-in objects, it would</div><div><br></div><div>* Allow either Python bytes objects or Python unicode objects as input</div>


<div>     a) bytes objects would be passed through as-is<br></div><div>     b) unicode objects would be encoded as latin-1</div><div> </div><div>[note: I'm not entirely sure that bytes objects should be allowed, but it would provide an nice efficiency in a fairly common case]</div>


<div><br></div><div>* It would create python unicode text objects, decoded as latin-1.</div><div><br></div><div>Could we have a way to specify another encoding? I'm not sure how that would fit into the dtype system.</div>


<div><br></div><div>I've explained the latin-1 thing on other threads, but the short version is:</div><div><br></div><div> - It will work perfectly for ascii text</div><div> - It will work perfectly for latin-1 text (natch)</div>


<div> - It will never give you an UnicodeEncodeError regardless of what arbitrary bytes you pass in.</div><div> - It will preserve those arbitrary bytes through a encoding/decoding operation.</div><div>

<br>

</div><div>(it still wouldn't allow you to store arbitrary unicode -- but that's the limitation of one-byte per character...)</div><div><br></div><div>So:</div><div><br></div><div>Bad idea all around: shut up already!</div>


<div><br></div><div>or</div><div><br></div><div>Fine idea, but who's going to write the code? not me!</div><div><br></div><div>or</div><div><br></div><div>We really should do this.</div></div></div></blockquote><div>


<br></div><div>As evident from what I said in the previous thread, YES, this should really be done!</div><div><br></div><div>One important feature would be changing the dtype from 'S' to 's' without any memory copies, so that conversion would be very cheap.  Maybe this would essentially come for free with something like astype('s', copy=False).</div>


<div><br></div><div>- Tom</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>


<div><br></div><div>(of course, with the options of amending the above not-very-fleshed out proposal)</div><span class="HOEnZb"><font color="#888888"><div><br></div><div>-Chris</div><div><br></div></font></span></div><span class="HOEnZb"><font color="#888888"><div>


-- <br><br>Christopher Barker, Ph.D.<br>


Oceanographer<br><br>Emergency Response Division<br>NOAA/NOS/OR&R            <a href="tel:%28206%29%20526-6959" value="+12065266959" target="_blank">(206) 526-6959</a>   voice<br>7600 Sand Point Way NE   <a href="tel:%28206%29%20526-6329" value="+12065266329" target="_blank">(206) 526-6329</a>   fax<br>


Seattle, WA  98115       <a href="tel:%28206%29%20526-6317" value="+12065266317" target="_blank">(206) 526-6317</a>   main reception<br><br>

<a href="mailto:Chris.Barker@noaa.gov" target="_blank">Chris.Barker@noaa.gov</a>

</div></font></span></div>

<br>_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

<br></blockquote></div><br></div></div>