[Python-3000] string C API
Nick Coghlan
ncoghlan at gmail.com
Sat Sep 16 05:14:49 CEST 2006
Jim Jewett wrote:
> On 9/15/06, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> If you're reading text and you *know* it is ASCII data, then you can
>> just set
>> the encoding to latin-1
>
> Only if latin-1 is a valid encoding for the internal implementation.
I think the possible internal encodings should be latin-1, UCS-2 and UCS-4,
with the size for a given string dictated by the largest codepoint in the
string at creation time.
That way the internal representation of a string would only need to grow one
extra field (the one saying how many bytes there are per character), and the
internal state would remain immutable.
For 8-bit source data, 'latin-1' would then be the most efficient encoding, in
that it would be a simple memcpy from the bytes object's internal buffer to
the string object's internal buffer. Other encodings like 'koi8-r' would be
decoded to either latin-1, UCS-2 or UCS-4 depending on the largest code point
in the source data.
[Jim]
> If it is, then python does have to allow multiple internal
> implementations, and some way of marking which was used. (Obviously,
> I think this is the right answer, but this is a change form 2.x, and
> would require some changes to the C API.)
One of the paragraphs you cut when replying to my message:
[Nick]
>> Far, far simpler is the idea of supporting only latin-1, UCS-2 and UCS-4 as
>> internal representations, and choosing which one to use when the string is
>> created.
I think we might be violently agreeing :)
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
More information about the Python-3000
mailing list