[Python-3000] string C API

Sat Sep 16 08:32:37 CEST 2006

Nick Coghlan schrieb:
> That way the internal representation of a string would only need to grow
> one extra field (the one saying how many bytes there are per character),
> and the internal state would remain immutable.

You could play tricks with ob_size to save this field:

- ob_size < 0: 8-bit data; length is abs(ob_size)
- ob_size > 0, (ob_size & 1)==0: 16-bit data, length is ob_size/2
- ob_size > 0, (ob_size & 1)==1: 32-bit data, length is ob_size/2

The first representation constrains the length of an 8-bit
representation to max_ssize_t, which is also the limit today.
For 16-bit strings, the limit is max_ssize_t/2, which means
max_ssize_t bytes; this is technically more constraining, but
such a string would still consume half of the address space,
and is unlikely to get created (*). For 32-bit strings, the
limit is also max_ssize_t/2, yet the maximum string would
require more than 2*max_ssize_t (==max_size_t) bytes, so
this isn't a real limitation.

> For 8-bit source data, 'latin-1' would then be the most efficient
> encoding, in that it would be a simple memcpy from the bytes object's
> internal buffer to the string object's internal buffer. Other encodings
> like 'koi8-r' would be decoded to either latin-1, UCS-2 or UCS-4
> depending on the largest code point in the source data.

This might somewhat slow-down codecs, which would have to scan the input
string first to find out what the maximum code point is, where they
currently can decode in a single pass. Of course, for multi-byte codecs,
such scanning is a good idea, anyway (some currently overallocate just
to avoid the second pass).

Regards,
Martin

(*) Many systems don't allow such large memory blocks,anyway.
E.g. on 32-bit Windows, in the standard configuration, the
address space is "only" 2GB.