[Python-3000] string C API

Fri Sep 15 18:22:30 CEST 2006

On 9/15/06, Jim Jewett <jimjjewett at gmail.com> wrote:
> There should be only one reference to a string until is constructed,
> and after that, its data should be immutable.  Recoding that results
> in different bytes should not be in-place.  Either it returns a new
> string (no problem) or it doesn't change the databuffer-and-encoding
> pointer until the new databuffer is fully constructed.

Yes, but then having, say, a Latin-1 string, and repeatedly using it
in places where UTF-16 is needed, causes you to repeat the decoding
operation.  The optimization becomes a pessimization.

Here I'm imagining things like taking len(s) of a UTF-8 string, or
s==u where u happens to be UTF-16.  You only have to do this once or
twice per string to start losing.

Also, having two different classes of strings means fewer felicitous
cases of x==y, where the result is True, being just a pointer
comparison.  This might matter in dictionaries: imagine a dictionary
created as a literal and then used to look up key strings read from a
file.

> [Nick Coghlan wrote:]
> > [...] the
> > application is free to decouple the "reading" and "decoding" steps, and just
> > transfer raw bytes between the streams.
>
> So adding boilerplate to treat text as bytes "for efficiency" may
> become a standard recipe?  Not so good.

I'm sure this will happen to the same degree that it's become a
standard recipe in Java and C# (both of which lack polymorphic
whatzits).  Which is to say, not at all.

-j