<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Jan 20, 2014 at 8:00 AM, Aldcroft, Thomas <span dir="ltr"><<a href="mailto:aldcroft@head.cfa.harvard.edu" target="_blank">aldcroft@head.cfa.harvard.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote"><div><div class="h5">On Mon, Jan 20, 2014 at 5:11 AM, Oscar Benjamin <span dir="ltr"><<a href="mailto:oscar.j.benjamin@gmail.com" target="_blank">oscar.j.benjamin@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>On Fri, Jan 17, 2014 at 02:30:19PM -0800, Chris Barker wrote:<br>

> Folks,<br>

><br>

> I've been blathering away on the related threads a lot -- sorry if it's too<br>

> much. It's gotten a bit tangled up, so I thought I'd start a new one to<br>

> address this one question (i.e. dont bring up genfromtext here):<br>

><br>

> Would it be a good thing for numpy to have a one-byte--per-character string<br>

> type?<br>

<br>

</div>If you mean a string type that can only hold latin-1 characters then I think<br>

that this is a step backwards.<br>

<br>

If you mean a dtype that holds bytes in a known, specifiable encoding and<br>

automatically decodes them to unicode strings when you call .item() and has a<br>

friendly repr() then that may be a good idea.<br>

<br>

So for example you could have dtype='S:utf-8' which would store strings<br>

encoded as utf-8 e.g.:<br>

<br>

>>> text = array(['foo', 'bar'], dtype='S:utf-8')<br>

>>> text<br>

array(['foo', 'bar'], dtype='|S3:utf-8')<br>

>>> print(a)<br>

['foo', 'bar']<br>

>>> a[0]<br>

'foo'<br>

>>> a.nbytes<br>

<div>6<br>

<br>

> We did have that with the 'S' type in py2, but the changes in py3 have made<br>

> it not quite the right thing. And it appears that enough people use 'S' in<br>

> py3 to mean 'bytes', so that we can't change that now.<br>

<br>

</div>It wasn't really the right thing before either. That's why Python 3 has<br>

changed all of this.<br>

<div><br>

> The only difference may be that 'S' currently auto translates to a bytes<br>

> object, resulting in things like:<br>

><br>

> np.array(['some text',],  dtype='S')[0] == 'some text'<br>

><br>

> yielding False on Py3. And you can't do all the usual text stuff with the<br>

> resulting bytes object, either. (and it probably used the default encoding<br>

> to generate the bytes, so will barf on some inputs, though that may be<br>

> unavoidable.) So you need to decode the bytes that are given back, and now<br>

> that I think about it, I have no idea what encoding you'd need to use in<br>

> the general case.<br>

<br>

</div>You should let the user specify the encoding or otherwise require them to use<br>

the 'U' dtype.<br>

<div><br>

> So the correct solution is (particularly on py3) to use the 'U' (unicode)<br>

> dtype for text in numpy arrays.<br>

<br>

</div>Absolutely. Embrace the Python 3 text model. Once you understand the how, what<br>

and why of it you'll see that it really is a good thing!<br>

<div><br>

> However, the 'U' dtype is 4 bytes per character, and that may be "too big"<br>

> for some use-cases. And there is a lot of text in scientific data sets that<br>

> are pure ascii, or at least some 1-byte-per-character encoding.<br>

><br>

> So, in the spirit of having multiple numeric types that use different<br>

> amounts of memory, and can hold different ranges of values, a one-byte-per<br>

> character dtype would be nice:<br>

><br>

> (note, this opens the door for a 2-byte per (UCS-2) dtype too, I personally<br>

> don't think that's worth it, but maybe that's because I'm an english<br>

> speaker...)<br>

<br>

</div>You could just use a 2-byte encoding with the S dtype e.g.<br>

dtype='S:utf-16-le'.<br>

<div><br>

> It could use the 's' (lower-case s) type identifier.<br>

><br>

> For passing to/from python built-in objects, it would<br>

><br>

> * Allow either Python bytes objects or Python unicode objects as input<br>

>      a) bytes objects would be passed through as-is<br>

>      b) unicode objects would be encoded as latin-1<br>

><br>

> [note: I'm not entirely sure that bytes objects should be allowed, but it<br>

> would provide an nice efficiency in a fairly common case]<br>

<br>

</div>I think it would be a bad idea to accept bytes here. There are good reasons<br>

that Python 3 creates a barrier between the two worlds of text and bytes.<br>

Allowing implicit mixing of bytes and text is a recipe for mojibake. The<br>

TypeErrors in Python 3 are used to guard against conceptual errors that lead<br>

to data corruption. Attempting to undermine that barrier in numpy would be a<br>

backward step.<br>

<br>

I apologise if this is misplaced but there seems to be an attitude that<br>

scientific programming isn't really affected by the issues that have lead to<br>

the Python 3 text model. I think that's ridiculous; data corruption is a<br>

problem in scientific programming just as it is anywhere else.<br>

<div><br>

> * It would create python unicode text objects, decoded as latin-1.<br>

<br>

</div>Don't try to bless a particular encoding and stop trying to pretend that it's<br>

possible to write a sensible system where end users don't need to worry about<br>

and specify the encoding of their data.<br>

<div><br>

> Could we have a way to specify another encoding? I'm not sure how that<br>

> would fit into the dtype system.<br>

<br>

</div>If the encoding cannot be specified then the whole idea is misguided.<br>

<div><br>

> I've explained the latin-1 thing on other threads, but the short version is:<br>

><br>

>  - It will work perfectly for ascii text<br>

>  - It will work perfectly for latin-1 text (natch)<br>

>  - It will never give you an UnicodeEncodeError regardless of what<br>

> arbitrary bytes you pass in.<br>

>  - It will preserve those arbitrary bytes through a encoding/decoding<br>

> operation.<br>

<br>

</div>So what happens if I do:<br>

<br>

>>> with open('myutf-8-file.txt', 'rb') as fin:<br>

...     text = numpy.fromfile(fin, dtype='s')<br>

>>> text[0] # Decodes as latin-1 leading to mojibake.<br>

<br>

I would propose that it's better to be able to do:<br>

<br>

>>> with open('myutf-8-file.txt', 'rb') as fin:<br>

...     text = numpy.fromfile(fin, dtype='s:utf-8')<br>

<br>

There's really no way to get around the fact that users need to specify the<br>

encoding of their text files.<br>

<div><br>

> (it still wouldn't allow you to store arbitrary unicode -- but that's the<br>

> limitation of one-byte per character...)<br>

<br>

</div>You could if you use 'utf-8'. It would be one-byte-per-char for text that only<br>

contains ascii characters. However it would still support every character that<br>

the unicode consortium can dream up. </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

The only possible advantage here is as a memory optimisation (potentially<br>

having a speed impact too although it could equally be a speed regression).<br>

Otherwise it just adds needless complexity to numpy and to the code that uses<br>

the new dtype as well as limiting its ability to handle unicode. </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

How significant are the performance issues? Does anyone really use numpy for<br>

this kind of text handling? If you really are operating on gigantic text<br>

arrays of ascii characters then is it so bad to just use the bytes dtype and<br>

handle decoding/encoding at the boundaries? If you're not operating on<br>

gigantic text arrays is there really a noticeable problem just using the 'U'<br>

dtype?<br></blockquote><div><br></div></div></div><div>I use numpy for giga-row arrays of short text strings, so memory and performance issues are real.</div><div><br></div><div>As discussed in the previous parent thread, using the bytes dtype is really a problem because users of a text array want to do things like filtering (`match_rows = text_array == 'match'`), printing, or other manipulations in a natural way without having to continually use bytestring literals or `.decode('ascii')` everywhere.  I tried converting a few packages while leaving the arrays as bytestrings and it just ended up as a very big mess.</div>


<div><br></div><div>From my perspective the goal here is to provide a pragmatic way to allow numpy-based applications and end users to use python 3.  Something like this proposal seems to be the right direction, maybe not pure and perfect but a sensible step to get us there given the reality of scientific computing.</div>

</div></div></div></blockquote><div><br></div><div>I think that is right. Not having an effective way to handle these common scientific data sets will block acceptance of Python 3. But we do need to figure out the best way to add this functionality.<br>

<br></div><div>Chuck  <br></div></div></div></div>