[Python-3000] string C API

Fri Sep 15 19:04:08 CEST 2006

On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Jim Jewett wrote:

> >> ... would be necessary to at least *scan* the string when it
> >> was first created in order to ensure it can be decoded without errors

> > What happens today with strings?  I think the answer is:
> >     "Nothing.
> >      They print something odd when printed.
> >      They may raise errors when explicitly recoded to unicde."
> > Why is this a problem?

> We don't have 8-bit strings lying around in Py3k.

Right.  But we do in Py 2.x, and the equivalent delayed errors have
not been a serious problem.  I suppose that might change if everyone
were actually using unicode, so that more stuff got converted
eventually.  On the other hand, I'm not sure how many strings will
*ever* need recoding, if we don't do it on construction.

> To convert bytes to
> characters, they *must* be converted to unicode code points.

A "code point" doesn't exist in actual code; it has to be represented
by some concrete encoding.  The most common encodings are the UTF-8
and the various UTF-16 and UTF-32, but they are still concrete
encodings, rather than the "real" code point.  A bytestream in latin-1
(with meta-knowledge that it is in latin-1) represents that abstract
code points just as much as a bytestream in UTF8 would.  For some
purposes (including error detection) it is less efficient, but it is
just as valid.

> > I'm not so happy about the efficiency implication of the idea that
> > *all* strings *must* be validated (let alone recoded).

> Then always define latin-1 as the source encoding for your files - it will
> just pass the bytes straight through.

That would work for skipping validation.  It won't work if Python
insists on recoding everything to an internally privileged encoding.

> > Interning may get awkward if multiple encodings are allowed within a
> > program, regardless of whether they're allowed for single strings.  It
> > might make sense to intern only strings that are in the same encoding
> > as the source code.  (Or whose values are limited to ASCII?)

> Unicode strings don't have an encoding - they only store code points.

But these code points are stored somehow.  In py2.k, the decision was
to always use a specific privileged encoding, and to choose that
encoding at compile time.  This decision was not required by unicode;
it was chosen for implementation reasons.

> I admit that by using a separate Python object for the data buffer instead of
> a pointer to raw memory, the incref/decref in the processing code becomes the
> moral equivalent of a read lock, but consider the case where Thread A performs
> an operation and decides "I need to recode the buffer to UCS-4" at the same
> time that Thread B performs an operation and decides "I need to recode the
> buffer to UCS-4".

Then you end up doing it twice, and wasting even more space.   I
expect "never need to change the encoding" will be far more common
than
        (1)  Application is multithreaded
and     (2)  Multiple threads happen to be using the same string
and     (3)  Multiple threads need to recode it to the same new
encoding at the same time
and     (4)  This recoding need was in some way conditional, so the
programmer felt it was sensible to request it both places, instead of
just recoding once on creation.

> And this style has some very serious overhead implications, as each string
> would now require:
>    The string object, with a 32 or 64 bit pointer to the data buffer object
>    The data buffer object

> String memory overhead would double, with an additional 32 or 64 bits
> depending on platform. This is a pretty significant increase when it comes to
> identifier-length strings.

dicts already have to deal with this.  The workaround there was to
have a smalltable fastened to the dict, and to waste that smalltable
if the dictionary grows too large.  strings could do something
similar.  (Either all strings, keeping the original encoding, or just
small strings, so that not too much will ever be wasted.)

> >> Sure certain applications that are just copying from one data stream to
> >> another (both in the same encoding) may needlessly decode and then
> >> re-encode the data,

> > Other than text editors, "certain" includes almost any application I
> > have ever used, let alone written.

> If you're reading text and you *know* it is ASCII data, then you can just set
> the encoding to latin-1

Only if latin-1 is a valid encoding for the internal implementation.
If it is, then python does have to allow multiple internal
implementations, and some way of marking which was used.  (Obviously,
I think this is the right answer, but this is a change form 2.x, and
would require some changes to the C API.)

-jJ