[Python-3000] string C API

Fri Sep 15 16:25:08 CEST 2006

On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Martin v. Löwis wrote:
> > Nick Coghlan schrieb:
> >> Only the first such call on a given string, though - the idea is to use
> >> lazy decoding, not to avoid decoding altogether. Most manipulations
> >> (len, indexing, slicing, concatenation, etc) would require decoding to
> >> at least UCS-2 (or perhaps UCS-4).

Or other workarounds.

> > Ok. Then my objection is this: What about errors that occur in decoding?
> > What happens if the bytes are not meaningful in the presumed encoding?

> > ISTM that raising the exception lazily (which seems to be necessary)
> > would be very confusing.

> Yeah, it appears it would be necessary to at least *scan* the string when it
> was first created in order to ensure it can be decoded without errors later on.

What happens today with strings?  I think the answer is:
     "Nothing.
      They print something odd when printed.
      They may raise errors when explicitly recoded to unicde."
Why is this a problem?

I see nothing wrong with an explicit .validate() method.

I see nothing wrong with a program choosing to recode everything into
a known encoding, which would validate as a side-effect.  This would
be the moral equivalent of today's unicode() call.

I'm not so happy about the efficiency implication of the idea that
*all* strings *must* be validated (let alone recoded).

> I also realised there is another issue with an internal representation that
> can change over the life of a string, which is that of thread-safety.

> Since strings don't currently have any mutable internal state, it's possible
> to freely share them between threads (without this property, the interning
> behaviour would be doomed).

Interning may get awkward if multiple encodings are allowed within a
program, regardless of whether they're allowed for single strings.  It
might make sense to intern only strings that are in the same encoding
as the source code.  (Or whose values are limited to ASCII?)

> If strings could change the encoding of their internal buffers then they'd
> have to use a read/write lock internally on all operations that might be
> affected when the internal representation changes. Blech.

Why?

There should be only one reference to a string until is constructed,
and after that, its data should be immutable.  Recoding that results
in different bytes should not be in-place.  Either it returns a new
string (no problem) or it doesn't change the databuffer-and-encoding
pointer until the new databuffer is fully constructed.

Anything keeping its own reference to the old databuffer (and old
encoding) will continue to work, so immutability ==> the two buffers
really are equivalent.

> Sure certain applications that are just copying from one data stream to
> another (both in the same encoding) may needlessly decode and then re-encode
> the data,

Other than text editors, "certain" includes almost any application I
have ever used, let alone written.

> but if the application *knows* that this might happen (and has
> reason to care about optimising the performance of this case), then the
> application is free to decouple the "reading" and "decoding" steps, and just
> transfer raw bytes between the streams.

So adding boilerplate to treat text as bytes "for efficiency" may
become a standard recipe?  Not so good.

-jJ