[Python-3000] string C API

Fri Sep 15 23:37:41 CEST 2006

On 9/15/06, Josiah Carlson <jcarlson at uci.edu> wrote:
>
> "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > Interning may get awkward if multiple encodings are allowed within a
> > program, regardless of whether they're allowed for single strings.  It
> > might make sense to intern only strings that are in the same encoding
> > as the source code.  (Or whose values are limited to ASCII?)

> Why?  If the text hash function is defined on *code points*, then
> interning, or really any arbitrary dictionary lookup is the same as it
> has always been.

The problem isn't the hash; it is the equality.  Which encoding do you
keep interned?

> What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
> method I previously described is that each of the encodings offer a
> minimal representation of the code points that the text object contains.

There may be some thrashing as

    s+= (larger char)
    s[:6]

The three options might well be a sensible choice, but I think it
would already have much of the disadvantage of multiple internal
encodings, and we might eventually regret any specific limits.  (Why
not the local 8-bit?  Why not UTF-8, if that is the system encoding?)
It is easy enough to answer why not for each specific case, but I'm
not *certain* that it is the right answer -- so why not leave it up to
implementors if they want to do more than the basic three?

> Presumably there is going to be a mechanism to open files as bytes
> (reads return bytes), and for things like web servers, file servers, etc.,
> serving the content up as just a bunch of bytes is really the only thing
> that makes sense.

If somone has to recognize that their document is "text" when they
edit it, but "bytes" when they serve it over the web, and then "text"
again when they view it in the browser ... that is a recipe for
misunderstandings.

-jJ