[Python-3000] string C API

Fri Sep 15 19:46:52 CEST 2006

"Jim Jewett" <jimjjewett at gmail.com> wrote:
> Interning may get awkward if multiple encodings are allowed within a
> program, regardless of whether they're allowed for single strings.  It
> might make sense to intern only strings that are in the same encoding
> as the source code.  (Or whose values are limited to ASCII?)

Why?  If the text hash function is defined on *code points*, then
interning, or really any arbitrary dictionary lookup is the same as it
has always been.

> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable.  Recoding that results
> in different bytes should not be in-place.  Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.

What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
method I previously described is that each of the encodings offer a
minimal representation of the code points that the text object contains. 
Certain operations would require a bit of work to handle the comparison
of code points stored in an x-bit-wide representation with code points
stored in a y-bit-wide representation.

> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe?  Not so good.

Presumably there is going to be a mechanism to open files as bytes
(reads return bytes), and for things like web servers, file servers, etc.,
serving the content up as just a bunch of bytes is really the only thing
that makes sense.

 - Josiah