<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Apr 21, 2017 at 2:34 PM, Stephan Hoyer <span dir="ltr"><<a href="mailto:shoyer@gmail.com" target="_blank">shoyer@gmail.com</a>></span> wrote:<br><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra">In this case, we want something compatible with Python's string (i.e. full Unicode supporting) and I think should be as transparent as possible. Python's string has made the decision to present a character oriented API to users (despite what the manifesto says...).</div></div></blockquote><div><br></div></span><div>Yes, but NumPy doesn't really implement string operations, so fortunately this is pretty irrelevant to us -- except for our API for specifying dtype size.</div></div></div></div></blockquote><div><br></div><div>Exactly -- the character-orientation of python strings means that people are used to thinking that strings have a length that is the number of characters in the string. I think there will a cognitive dissonance if someone does:</div><div><br></div><div>arr[i] = a_string</div><div><br></div><div>Which then raises a ValueError, something like:</div><div><br></div><div>String too long for a string[12] dytype array.</div><div><br></div><div>When len(a_string) <= 12</div><div><br></div><div>AND that will only  occur if there are non-ascii characters in the string, and maybe only if there are more than N non-ascii characters. i.e. it is very likely to be a run-time error that may not have shown up in tests.</div><div><br></div><div>So folks need to do something like:</div><div><br></div><div>len(a_string.encode('utf-8')) to see if their string will fit. If not, they need to truncate it, and THAT is non-obvious how to do, too -- you don't want to truncate the encodes bytes naively, you could end up with an invalid bytestring. but you don't know how many characters to truncate, either.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>We already have strong precedence for dtypes reflecting number of bytes used for storage even when Python doesn't: consider numeric types like int64 and float32 compared to the Python equivalents. It's an intrinsic aspect of NumPy that users need to think about how their data is actually stored.</div></div></div></div></blockquote><div><br></div><div>sure, but a float64 is 64 bytes forever an always and the defaults perfectly match what python is doing under its hood --even if users don't think about. So the default behaviour of numpy matched python's built-in types.</div><div> </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra">Storage cost is always going to be a concern. Arguably, it's even more of a concern today than it used to be be, because compute has been improving faster than storage.</div></div></blockquote></span></div></div></div></blockquote><div><br></div><div>sure -- but again, what is the use-case for numpy arrays with a s#$)load of text in them? common? I don't think so. And as you pointed out numpy doesn't do text processing anyway, so cache performance and all that are not important. So having UCS-4 as the default, but allowing folks to select a more compact format if they really need it is a good way to go. Just like numpy generally defaults to float64 and Int64 (or 32, depending on platform) -- users can select a smaller size if they have a reason to.</div><div><br></div><div>I guess that's my summary -- just like with numeric values, numpy should default to Python-like behavior as much as possible for strings, too -- with an option for a knowledgeable user to do something more performant.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>I still don't understand why a latin encoding makes sense as a preferred one-byte-per-char dtype. The world, including Python 3, has standardized on UTF-8, which is also one-byte-per-char for (ASCII) scientific data.</div></div></div></div></blockquote><div><br></div><div>utf-8 is NOT a one-byte per char encoding. IF you want to assure that your data are one-byte per char, then you could use ASCII, and it would be binary compatible with utf-8, but not sure what the point of that is in this context.</div><div><br></div><div>latin-1 or latin-9 buys you (over ASCII):</div><div><br></div><div>- A bunch of accented characters -- sure it only covers the latin languages, but does cover those much better.</div><div><br></div><div>- A handful of other characters, including scientifically useful ones. (a few greek characters, the degree symbol, etc...)</div><div><br></div><div>- round-tripping of binary data (at least with Python's encoding/decoding) -- ANY string of bytes can be decodes as latin-1 and re-encoded to get the same bytes back. You may get garbage, but you won't get an EncodingError.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_extra">For Python use -- a pointer to a Python string would be nice.</div></div></div></blockquote><div><br></div></span><div>Yes, absolutely. If we want to be really fancy, we could consider a parametric object dtype that allows for object arrays of *any* homogeneous Python type. Even if NumPy itself doesn't do anything with that information, there are lots of use cases for that information.</div></div></div></div></blockquote><div><br></div><div>hmm -- that's nifty idea -- though I think strings could/should be special cased.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_extra">Then use a native flexible-encoding dtype for everything else.<br></div></div></div></blockquote><div><br></div></span><div>No opposition here from me. Though again, I think utf-8 alone would also be enough.</div></div></div></div></blockquote><div><br></div><div>maybe so -- the major reason for supporting others is binary data exchange with other libraries -- but maybe most of them have gone to utf-8 anyway.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_extra">One more note: if a user tries to assign a value to a numpy string array that doesn't fit, they should get an error:<br></div></div></div></blockquote></span><span class="gmail-"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_extra"><br></div><div class="gmail_extra">EncodingError if it can't be encoded into the defined encoding.</div><div class="gmail_extra"><br></div><div class="gmail_extra">ValueError if it is too long -- it should not be silently truncated.</div></div></div></blockquote><div><br></div></span><div>I think we all agree here.</div></div></div></div>

</blockquote></div><div class="gmail_extra"><br></div>I'm actually having second thoughts -- see above -- if the encoding is utf-8, then truncating is non-trivial -- maybe it would be better for numpy to do it for you. Or set a flag as to which you want?</div><div class="gmail_extra"><br></div><div class="gmail_extra">The current 'S' dtype truncates silently already:</div><div class="gmail_extra"><br></div><font face="monospace, monospace">In [6]: arr</font><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Out[6]: <br>array(['this', 'that'], <br>      dtype='|S4')<br><br></font><div><font face="monospace, monospace">In [7]: arr[0] = "a longer string"<br><br></font></div><div><font face="monospace, monospace">In [8]: arr<br><br></font></div><div><font face="monospace, monospace">Out[8]: <br>array(['a lo', 'that'], <br>      dtype='|S4')</font><div class="gmail_extra"><br></div><div class="gmail_extra">(similarly for the unicode type)<br><br>So at least we are used to that.</div><div class="gmail_extra"><br></div><div class="gmail_extra">BTW -- maybe we should keep the pathological use-case in mind: really short strings. I think we are all thinking in terms of longer strings, maybe a name field, where you might assign 32 bytes or so -- then someone has an accented character in their name, and then ge30 or 31 characters -- no big deal.</div><div class="gmail_extra"><br></div><div class="gmail_extra">But what if you have a simple label or something with 1 or two characters: Then you have 2 bytes to store the name in, and someone tries to put an "odd" character in there, and you get an empty string. not good.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Also -- if utf-8 is the default -- what do you get when you create an array from a python string sequence? Currently with the 'S' and 'U' dtypes, the dtype is set to the longest string passed in. Are we going to pad it a bit? stick with the exact number of bytes? </div><div class="gmail_extra"><br></div><div class="gmail_extra">It all comes down to this:</div><div class="gmail_extra"><br></div><div class="gmail_extra">Python3 has made a very deliberate (and I think Good) choice to treat text as a string of characters, where the user does not need to know or care about encoding issues. Numpy's defaults should do the same thing.</div><div class="gmail_extra"><br></div><div class="gmail_extra">-CHB</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra"><div><br></div>-- <br><div class="gmail_signature"><br>Christopher Barker, Ph.D.<br>Oceanographer<br><br>Emergency Response Division<br>NOAA/NOS/OR&R            (206) 526-6959   voice<br>7600 Sand Point Way NE   (206) 526-6329   fax<br>Seattle, WA  98115       (206) 526-6317   main reception<br><br><a href="mailto:Chris.Barker@noaa.gov" target="_blank">Chris.Barker@noaa.gov</a></div>

</div></div></div></div>