On Tue, Jan 21, 2014 at 11:41:30AM +0000, Nathaniel Smith wrote:
On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters.
There are various optimisations possible as well.
For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the "pointer" itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.)
In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning.
Would this new dtype have an opaque memory representation? What would happen in the following:
a = numpy.array(['CGA', 'GAT'], dtype='s')
memoryview(a)
with open('file', 'wb') as fout: ... a.tofile(fout)
with open('file', 'rb') as fin: ... a = numpy.fromfile(fin, dtype='s')
Should there be a different function for creating such an array from reading a text file? Or would you just need to use fromiter:
with open('file', encoding='utf-8') as fin: ... a = numpy.fromiter(fin, dtype='s')
with open('file', encoding='utf-8') as fout: ... fout.writelines(line + '\n' for line in a)
(Note that the above would not be reversible if the strings contain newlines) I think it Would be less confusing to use dtype='u' than dtype='U' in order to signify that it is an optimised form of the 'U' dtype as far as access from Python code is concerned? Calling it 's' only really makes sense if there is a plan to deprecate dtype='S'. How would it behave in Python 2? Would it return unicode strings there as well? Oscar