Re: [Numpy-discussion] A one-byte string dtype?

Jan. 21, 2014

      On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin@gmail.com> wrote:
...
If the Numpy array would manage the buffers itself then that per string
memory
overhead would be eliminated in exchange for an 8 byte pointer and at
least 1
byte to represent the length of the string (assuming you can somehow use
Pascal strings when short enough - null bytes cannot be used). This gives
an
overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
memory
if the strings are more than 3 characters long and you get at least a 50%
saving for strings longer than 9 characters.
There are various optimisations possible as well.

For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)

In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

-n

Re: [Numpy-discussion] A one-byte string dtype?

Nathaniel Smith