[Python-3000] string C API

Thu Sep 14 18:47:17 CEST 2006

On 9/14/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Marcin 'Qrczak' Kowalczyk" <qrczak at knm.org.pl> wrote:
> > Nick Coghlan <ncoghlan at gmail.com> writes:
> >
> > > Only the first such call on a given string, though - the idea
> > > is to use lazy decoding, not to avoid decoding altogether.
> > > Most manipulations (len, indexing, slicing, concatenation, etc)
> > > would require decoding to at least UCS-2 (or perhaps UCS-4).
> >
> > Silently optimizing string recoding might change the way recoding
> > errors are reported. i.e. they might not be reported at all even
> > if the string is malformed. Optimizations which change the semantics
> > are bad.
>
> This is not a problem.  During construction of the string, you would
> either be recoding the original string to the standard 'compressed'
> format, or if they had the same format, you would attempt a decoding,
> and on failure, claim that the input wasn't in the encoding originally
> specified.
>
>
> Personally though, I'm not terribly inclined to believe that using a
> 'compressed' representation of utf-8 is desireable.  Why not use latin-1
> when possible, ucs-2 when latin-1 isn't enough, and ucs-4 when ucs-2
> isn't enough?  You get a fixed-width character encoding, and aside from
> the (annoying) need to write variants of each string function for each
> width (macros would help here), or generic versions of each, you never
> need to recode the initial string after it has been created.
>
> Even better, with a slightly modified buffer interface, these characters
> can be exposed to C extensions in a somewhat transparent manner (if
> desired).

The argument for UTF-8 is probably interop efficiency. Lots of C
libraries, file formats, and wire protocols use UTF-8 for interchange.
Verifying the validity of UTF-8 during string creation isn't that big
of a deal.

-bob