There have been a few threads discussing the problems of how to do text with numpy arrays in Python 3. To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3. It is intended to be an illustration since I think that the real solution is a new dtype rather than an array subclass (so that it can be used in e.g. record arrays). The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings. I believe it would not be as hard to implement as the proposals for variable length string arrays. The one caveat is that it will strip null characters from the end of any string. I'm not 100% that the byte stripping encoding function will always work but it will for all the encodings I know and it seems to work with all the encodings that Python has. The code is inline below and attached (in case there are encoding problems with this message!): Oscar #!/usr/bin/env python3 from numpy import ndarray, array class textarray(ndarray): '''ndarray for holding encoded text. This is for demonstration purposes only. The real proposal is to specify the encoding as a dtype rather than a subclass. Only works as a 1-d array. >>> a = textarray(['qwert', 'zxcvb'], encoding='ascii') >>> a textarray(['qwert', 'zxcvb'], dtype='|S5:ascii') >>> a[0] 'qwert' >>> a.tostring() b'qwertzxcvb' >>> a[0] = 'qwe' # shorter string >>> a[0] 'qwe' >>> a.tostring() b'qwe\\x00\\x00zxcvb' >>> a[0] = 'qwertyuiop' # longer string Traceback (most recent call last): ... ValueError: Encoded bytes don't fit >>> b = textarray(['Õscar', 'qwe'], encoding='utf-8') >>> b textarray(['Õscar', 'qwe'], dtype='|S6:utf-8') >>> b[0] 'Õscar' >>> b[0].encode('utf-8') b'\\xc3\\x95scar' >>> b.tostring() b'\\xc3\\x95scarqwe\\x00\\x00\\x00' >>> c = textarray(['qwe'], encoding='utf-32-le') >>> c textarray(['qwe'], dtype='|S12:utf-32-le') ''' def __new__(cls, strings, encoding='utf-8'): bytestrings = [s.encode(encoding) for s in strings] a = array(bytestrings, dtype='S').view(textarray) a.encoding = encoding return a def __repr__(self): slist = ', '.join(repr(self[n]) for n in range(len(self))) return "textarray([%s], \n dtype='|S%d:%s')"\ % (slist, self.itemsize, self.encoding) def __getitem__(self, index): bstring = ndarray.__getitem__(self, index) return self._decode(bstring) def __setitem__(self, index, string): bstring = string.encode(self.encoding) if len(bstring) > self.itemsize: raise ValueError("Encoded bytes don't fit") ndarray.__setitem__(self, index, bstring) def _decode(self, b): b = b + b'\0' * (4 - len(b) % 4) s = b.decode(self.encoding) for n, c in enumerate(reversed(s)): if c != '\0': return s[:len(s)-n] return s if __name__ == "__main__": import doctest doctest.testmod()
Oscar, Cool stuff, thanks! I'm wondering though what the use-case really is. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs. So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker. Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs. If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable. To make a slightly more concrete proposal, I've implemented a pure
Python ndarray subclass that I believe can consistently handle text/bytes in Python 3.
this scares me right there -- is it text or bytes??? We really don't want something that is both.
The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings.
I believe it would not be as hard to implement as the proposals for variable length string arrays.
except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users "why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?)
The one caveat is that it will strip null characters from the end of any string.
which is fatal, but you do want a new dtype after all, which presumably wouldn't do that. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Fri, Jan 24, 2014 at 5:43 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Oscar,
Cool stuff, thanks!
I'm wondering though what the use-case really is. The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs.
So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker.
In my opinion something like Oscar's class would be very useful (with some adjustments, especially making it easy to create an S view or put a encoding view on top of an S array). (Disclaimer: My only experience is in converting some examples in statsmodels to bytes in py 3 and to play with some examples.) My guess is that 'S'/bytes is very convenient for library code, because it doesn't care about encodings (assuming we have enough control that all bytes are in the same encoding), and we don't have any overhead to convert to strings when comparing or working with "byte strings". 'S' is also very flexible because it doesn't tie us down to a minimum size for the encoding nor any specific encoding. The problem of 'S'/bytes is in input output and interactive work, as in the examples of Tom Aldcroft. The textarray dtype would allow us to view any 'S' array so we can have text/string interaction with python and get the correct encoding on input and output. Whether you live in an ascii, latin1, cp1252, iso8859_5 or in any other world, you could get your favorite minimal memory S/bytes/strings. I think this is useful as a complement to the current 'S' type, and to make that more useful on python 3, independent of what other small memory unicode dtype with predefined encoding numpy could get.
Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs.
If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable.
To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3.
this scares me right there -- is it text or bytes??? We really don't want something that is both.
Most users won't care about the internal representation of anything. But when we want or find it useful we can view the memory with any compatible dtype. That is, with numpy we always have also raw "bytes". And there are lot's of ways to shoot yourself why would you want to to that? :
a = np.arange(5) b = a.view('S4') b[1] = 'h' a array([ 0, 104, 2, 3, 4])
a[1] = 'h' Traceback (most recent call last): File "<pyshell#22>", line 1, in <module> a[1] = 'h' ValueError: invalid literal for int() with base 10: 'h'
The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings.
I believe it would not be as hard to implement as the proposals for variable length string arrays.
except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users "why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?)
not really different to other places where you have to pay attention to the underlying dtype, and a question of providing the underlying information. (like itemsize) 1 - 1e-20 I had code like that when I wasn't thinking properly or wasn't paying enough attention to what I was typing.
The one caveat is that it will strip null characters from the end of any string.
which is fatal, but you do want a new dtype after all, which presumably wouldn't do that.
The only place so far that I found where this really hurts is in the decode examples (with utf32LE for example). That's why I think numpy needs to have decode/encode functions, so it can access the bytes before they are null truncated, besides being vectorized. BTW: I wanted to start a new thread "in defence of (null truncated) 'S' string bytes", but I ran into too many other issues to work out the examples. Josef
-Chris
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 24 January 2014 22:43, Chris Barker <chris.barker@noaa.gov> wrote:
Oscar,
Cool stuff, thanks!
I'm wondering though what the use-case really is.
The use-case is precisely the use-case for dtype='S' on Py2 except that it also works on Py3.
The P3 text model (actually the py2 one, too), is quite clear that you want users to think of, and work with, text as text -- and not care how things are encoding in the underlying implementation. You only want the user to think about encodings on I/O -- transferring stuff between systems where you can't avoid it. And you might choose different encodings based on different needs.
Exactly. But what you're missing is that storing text in a numpy array is putting the text into bytes and the encoding needs to be specified. My proposal involves explicitly specifying the encoding. This is the key point about the Python 3 text model: it is not that encoding isn't automatic (e.g. when you print() or call file.write with a text file); the point is that there must never be ambiguity about the encoding that is used when encode/decode occurs.
So why have a different, the-user-needs-to-think-about-encodings numpy dtype? We already have 'U' for full-on unicode support for text. There is a good argument for a more compact internal representation for text compatible with one-byte-per-char encoding, thus the suggestion for such a dtype. But I don't see the need for quite this. Maybe I'm not being a creative enough thinker.
Because users want to store text in a numpy array and use less than 4 bytes per character. You expressed a desire for this. The only difference between this and your latin-1 suggestion is that this one has an explicit encoding that is visible to the user and that you can choose that encoding to be anything that your Python installation supports.
Also, we may want numpy to interact at a low level with other libs that might have binary encoded text (HDF, etc) -- in which case we need a bytes dtype that can store that data, and perhaps encoding and decoding ufuncs.
Perhaps there is a need for a bytes dtype as well. But not that you can use textarray with encoding='ascii' to satisfy many of these use cases. So h5py and pytables can expose an interface that stores text as bytes but has a clearly labelled (and enforced) encoding.
If we want a more efficient and compact unicode implementation then the py3 one is a good place to start -it's pretty slick! Though maybe harder to due in numpy as text in numpy probably wouldn't be immutable.
It's not a good fit for numpy because numpy arrays expose their memory buffer. More on this below but if there was to be something as drastic as the FSR then it would be better to think about how to make an ndarray type that is completely different, has an opaque memory buffer and can handle arbitrary length text strings.
To make a slightly more concrete proposal, I've implemented a pure Python ndarray subclass that I believe can consistently handle text/bytes in Python 3.
this scares me right there -- is it text or bytes??? We really don't want something that is both.
I believe that there is a conceptual misunderstanding about what a numpy array is here. A numpy array is a clever view onto a memory buffer. A numpy array always has two interfaces, one that describes a memory buffer and one that delivers Python objects representing the abstract quantities described by each portion of the memory buffer. The dtype specifies three things: 1) How many bytes of the buffer are used. 2) What kind of abstract object this part of the buffer represents. 3) The mapping from the bytes in this segment of the buffer to the abstract object. As an example:
import numpy as np a = np.array([1, 2, 3], dtype='<u4') a array([1, 2, 3], dtype=uint32) a.tostring() b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'
So what is this array? Is it bytes or is it integers? It is both. The array is a view onto a memory buffer and the dtype is the encoding that describes the meaning of the bytes in different segments. In this case the dtype is '<u4'. This tells us that we need 4 bytes per segment, that each segment represents an integer and that the mapping from byte segments to integers is the unsigned little-endian mapping. How can we do the same thing with text? We need a way to map text to fixed-width bytes. Mapping text to bytes is done with text encodings. So we need a dtype that incorporates a text encoding in order to define the relationship between the bytes in the array's memory buffer and the abstract entity that is a sequence of Unicode characters. Using dtype='U' doesn't get around this:
a = np.array(['qwe'], dtype='U') a array(['qwe'], dtype='<U3') a[0] # text 'qwe' a.tostring() # bytes b'q\x00\x00\x00w\x00\x00\x00e\x00\x00\x00'
In my proposal you'd get the same by using 'utf-32-le' as the encoding for your text array.
The idea is that the array has an encoding. It stores strings as bytes. The bytes are encoded/decoded on insertion/access. Methods accessing the binary content of the array will see the encoded bytes. Methods accessing the elements of the array will see unicode strings.
I believe it would not be as hard to implement as the proposals for variable length string arrays.
except that with some encodings, the number of bytes required is a function of what the content of teh text is -- so it either has to be variable length, or a fixed number of bytes, which is not a fixed number of characters which require both careful truncation (a pain), and surprising results for users "why can't I fit 10 characters is a length-10 text object? And I can if they are different characters?)
It should be a fixed number of bytes. It does mean that 10 characters might not fit into a 10-byte text portion but there's no way around that if it is a fixed length and the encoding is variable-width. I don't really think that this is much of a problem though. Most use cases are probably going to use 'ascii' anyway. The improvement those use-cases get is error detection for non-ascii characters and explicitly labelled encodings, rather than mojibake.
The one caveat is that it will strip null characters from the end of any string.
which is fatal, but you do want a new dtype after all, which presumably wouldn't do that.
Why is that fatal for text (not arbitrary byte strings)? There are many other reasons (relating to other programming languages and software) why you can't usually put null characters into text anyway. I don't really see how to get around this if the bytes must go into fixed-width portions without an out-of-band way to specify the length of the string. Oscar
participants (3)
-
Chris Barker
-
josef.pktd@gmail.com
-
Oscar Benjamin