
On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
If the Numpy array would manage the buffers itself then that per string memory overhead would be eliminated in exchange for an 8 byte pointer and at least 1 byte to represent the length of the string (assuming you can somehow use Pascal strings when short enough - null bytes cannot be used). This gives an overhead of 9 bytes per string (or 5 on 32 bit). In this case you save memory if the strings are more than 3 characters long and you get at least a 50% saving for strings longer than 9 characters.
There are various optimisations possible as well. For ASCII strings of up to length 8, one could also use tagged pointers to eliminate the lookaside buffer entirely. (Alignment rules mean that pointers to allocated buffers always have the low bits zero; so you can make a rule that if the low bit is set to one, then this means the "pointer" itself should be interpreted as containing the string data; use the spare bit in the other bytes to encode the length.) In some cases it may also make sense to let identical strings share buffers, though this adds some overhead for reference counting and interning. -n