[Python-Dev] Internal representation of strings and Micropython

Chris Angelico rosuav at gmail.com
Wed Jun 4 09:06:25 CEST 2014


On Wed, Jun 4, 2014 at 5:02 PM,  <martin at v.loewis.de> wrote:
> There are more things to consider for the internal implementation,
> in particular how the string length is implemented. Several alternatives
> exist:
> 1. store the UTF-8 length (i.e. memory size)
> 2. store the number of code points (i.e. Python len())
> 3. store both
> 4. store neither, but use null termination instead
>
> Variant 3 is most run-time efficient, but could easily use 8 bytes
> just for the length, which could outweigh the storage of the actual
> data. Variants 1 and 2 lose on some operations (1 loses on computing
> len(), 2 loses on string concatenation). 3 would add the restriction
> of not allowing U+0000 in a string (which would be reasonable IMO),
> and make all length computations inefficient. However, it wouldn't
> be worse than standard C.

The current implementation stores a 16-bit length, which is both the
memory size and the len(). As far as I can see, the memory size is
never needed, so I'd just go for option 2; string concatenation is
already known to be one of those operations that can be slow if you do
it badly, and an optimized str.join() would cover the recommended
use-case.

ChrisA


More information about the Python-Dev mailing list