[Python-3000] string C API

Sat Sep 16 05:14:49 CEST 2006

Jim Jewett wrote:
> On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> If you're reading text and you *know* it is ASCII data, then you can 
>> just set
>> the encoding to latin-1
> 
> Only if latin-1 is a valid encoding for the internal implementation.

I think the possible internal encodings should be latin-1, UCS-2 and UCS-4, 
with the size for a given string dictated by the largest codepoint in the 
string at creation time.

That way the internal representation of a string would only need to grow one 
extra field (the one saying how many bytes there are per character), and the 
internal state would remain immutable.

For 8-bit source data, 'latin-1' would then be the most efficient encoding, in 
that it would be a simple memcpy from the bytes object's internal buffer to 
the string object's internal buffer. Other encodings like 'koi8-r' would be 
decoded to either latin-1, UCS-2 or UCS-4 depending on the largest code point 
in the source data.

[Jim]
> If it is, then python does have to allow multiple internal
> implementations, and some way of marking which was used.  (Obviously,
> I think this is the right answer, but this is a change form 2.x, and
> would require some changes to the C API.)

One of the paragraphs you cut when replying to my message:

[Nick]
>> Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as 
>> internal representations, and choosing which one to use when the string is 
>> created.

I think we might be violently agreeing :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org