[Python-3000] string C API

Sat Sep 16 02:13:33 CEST 2006

"Jim Jewett" <jimjjewett at gmail.com> wrote:
> On 9/15/06, Josiah Carlson <jcarlson at uci.edu> wrote:
> > "Jim Jewett" <jimjjewett at gmail.com> wrote:
> > > Interning may get awkward if multiple encodings are allowed within a
> > > program, regardless of whether they're allowed for single strings.  It
> > > might make sense to intern only strings that are in the same encoding
> > > as the source code.  (Or whose values are limited to ASCII?)
> 
> > Why?  If the text hash function is defined on *code points*, then
> > interning, or really any arbitrary dictionary lookup is the same as it
> > has always been.
> 
> The problem isn't the hash; it is the equality.  Which encoding do you
> keep interned?

There is one minimal 'encoding' for any unicode string (in one of
latin-1, ucs-2, or ucs-4), really being an array of minimal-width
char/short/int code points. Because all text objects are internally
represented in its minimal 'encoding', equal text objects will always be
in the same encoding.

> > What about never recoding?  The benefit of the latin-1/ucs-2/ucs-4
> > method I previously described is that each of the encodings offer a
> > minimal representation of the code points that the text object contains.
> 
> There may be some thrashing as
> 
>     s+= (larger char)
>     s[:6]

So there may be thrashing.  I don't see this as a problem.  String
addition and slicing is known linear in the length of the string being
produced for all nontrivial cases.  It's still linear.  What's the
problem?

> The three options might well be a sensible choice, but I think it
> would already have much of the disadvantage of multiple internal
> encodings, and we might eventually regret any specific limits.  (Why
> not the local 8-bit?  Why not UTF-8, if that is the system encoding?)
> It is easy enough to answer why not for each specific case, but I'm
> not *certain* that it is the right answer -- so why not leave it up to
> implementors if they want to do more than the basic three?

By "basic three" I presume you mean latin-1, ucs-2, and ucs-4.  I'm not
advocating for anything beyond those, in fact, I'm specifically
discouraging using anything other than those three, and I'm specifically
discouraging the idea of recoding internal representations.  Once a
text object is created, its internal state is fixed until it is
destroyed.

> > Presumably there is going to be a mechanism to open files as bytes
> > (reads return bytes), and for things like web servers, file servers, etc.,
> > serving the content up as just a bunch of bytes is really the only thing
> > that makes sense.
> 
> If somone has to recognize that their document is "text" when they
> edit it, but "bytes" when they serve it over the web, and then "text"
> again when they view it in the browser ... that is a recipe for
> misunderstandings.

They don't need to recognize anything when it is served onto the web. 
Just like they don't need to recognize anything right now.  The file is
served verbatim off of disk, which is then understood by the browser
because of encoding information built into the format.  If the format
doesn't have encoding information built into it, then the user isn't
going to be able to edit it.

 - Josiah