On Mon, Jan 20, 2014 at 04:12:20PM -0700, Charles R Harris wrote:
On Mon, Jan 20, 2014 at 3:35 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Jan 20, 2014 at 10:28 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
I didn't say we should change the S type, but that we should have
say 's', that appeared to python as a string. I think if we want
something, transparent
string interoperability with python together with a compressed representation, and I think we need both, we are going to have to deal with the difficulties of utf-8. That means raising errors if the string doesn't fit in the allotted size, etc. Mind, this is a workaround for the mass of ascii data that is already out there, not a substitute for 'U'.
If we're going to be taking that much trouble, I'd suggest going ahead and adding a variable-length string type (where the array itself contains a pointer to a lookaside buffer, maybe with an optimization for stashing short strings directly). The fixed-length requirement is pretty onerous for lots of applications (e.g., pandas always uses dtype="O" for strings -- and that might be a good workaround for some people in this thread for now). The use of a lookaside buffer would also make it practical to resize the buffer when the maximum code point changed, for that matter...
The more I think about it, the more I think we may need to do that. Note
On Mon, Jan 20, 2014 at 3:58 PM, Charles R Harris <charlesr.harris@gmail.com wrote: that dynd has ragged arrays and I think they are implemented as pointers to buffers. The easy way for us to do that would be a specialization of object arrays to string types only as you suggest.
This wouldn't necessarily help for the gigarows of short text strings use case (depending on what "short" means). Also even if it technically saves memory you may have a greater overhead from fragmenting your array all over the heap. On my 64 bit Linux system the size of a Python 3.3 str containing only ASCII characters is 49+N bytes. For the 'U' dtype it's 4N bytes. You get a memory saving over dtype='U' only if the strings are 17 characters or more. To get a 50% saving over dtype='U' you'd need strings of at least 49 characters. If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters. Using utf-8 in the buffers eliminates the need to go around checking maximum code points etc. so I would guess that would be simpler to implement (CPython has now had to triple all of it's code paths that actually access the string buffer). Oscar