[Python-3000] string C API

Thu Sep 14 18:46:06 CEST 2006

"Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> wrote:
> Nick Coghlan <ncoghlan at gmail.com> writes:
> 
> > Only the first such call on a given string, though - the idea
> > is to use lazy decoding, not to avoid decoding altogether.
> > Most manipulations (len, indexing, slicing, concatenation, etc)
> > would require decoding to at least UCS-2 (or perhaps UCS-4).
> 
> Silently optimizing string recoding might change the way recoding
> errors are reported. i.e. they might not be reported at all even
> if the string is malformed. Optimizations which change the semantics
> are bad.

This is not a problem.  During construction of the string, you would
either be recoding the original string to the standard 'compressed'
format, or if they had the same format, you would attempt a decoding,
and on failure, claim that the input wasn't in the encoding originally
specified.

Personally though, I'm not terribly inclined to believe that using a
'compressed' representation of utf-8 is desireable.  Why not use latin-1
when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2
isn't enough?  You get a fixed-width character encoding, and aside from
the (annoying) need to write variants of each string function for each
width (macros would help here), or generic versions of each, you never
need to recode the initial string after it has been created.

Even better, with a slightly modified buffer interface, these characters
can be exposed to C extensions in a somewhat transparent manner (if
desired).

 - Josiah