[Numpy-discussion] A one-byte string dtype?

Nathaniel Smith njs at pobox.com
Tue Jan 21 06:41:30 EST 2014


On 21 Jan 2014 11:13, "Oscar Benjamin" <oscar.j.benjamin at gmail.com> wrote:
> If the Numpy array would manage the buffers itself then that per string
memory
> overhead would be eliminated in exchange for an 8 byte pointer and at
least 1
> byte to represent the length of the string (assuming you can somehow use
> Pascal strings when short enough - null bytes cannot be used). This gives
an
> overhead of 9 bytes per string (or 5 on 32 bit). In this case you save
memory
> if the strings are more than 3 characters long and you get at least a 50%
> saving for strings longer than 9 characters.

There are various optimisations possible as well.

For ASCII strings of up to length 8, one could also use tagged pointers to
eliminate the lookaside buffer entirely. (Alignment rules mean that
pointers to allocated buffers always have the low bits zero; so you can
make a rule that if the low bit is set to one, then this means the
"pointer" itself should be interpreted as containing the string data; use
the spare bit in the other bytes to encode the length.)

In some cases it may also make sense to let identical strings share
buffers, though this adds some overhead for reference counting and
interning.

-n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140121/b615ddf7/attachment.html>


More information about the NumPy-Discussion mailing list