[Python-3000] string C API

Fri Sep 15 19:48:06 CEST 2006

"Jason Orendorff" <jason.orendorff at gmail.com> wrote:
> 
> On 9/15/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> > There should be only one reference to a string until is constructed,
> > and after that, its data should be immutable.  Recoding that results
> > in different bytes should not be in-place.  Either it returns a new
> > string (no problem) or it doesn't change the databuffer-and-encoding
> > pointer until the new databuffer is fully constructed.
> 
> Yes, but then having, say, a Latin-1 string, and repeatedly using it
> in places where UTF-16 is needed, causes you to repeat the decoding
> operation.  The optimization becomes a pessimization.
> 
> Here I'm imagining things like taking len(s) of a UTF-8 string, or
> s==u where u happens to be UTF-16.  You only have to do this once or
> twice per string to start losing.

This is one of the reasons why I was talking Latin-1, UCS-2, and UCS-4:

If I have a text object X whose internal representation is in UCS-2, and
I have a another text object Y whose internal representation is in UCS-4,
then I know X != Y.  Why?  Because X and Y were created with the minimal
width necessary to support the code points they contain. Because Y must
have a code point that X doesn't have, then X != Y.

When one wants to do things like Y.startswith(X), then you actually
compare the code points.

 - Josiah