[Python-3000] characters data type

Tue May 2 23:47:07 CEST 2006

On 5/2/06, Talin <talin at acm.org> wrote:
> It appears that my question has been misunderstood by everyone; I'll try
> to phrase it better:
>
> The short version is: will there be a mutable character array type? (which
> I am calling "characters"?)

There are no plans for this AFAIK.

> First, I do use array, not a lot but I do use it occasionally. One common
> use case is equivalent to the Java StringBuffer class - that is, a means
> for building up strings a character at a time, which otherwise would be
> expensive to do with immutable strings.

Better ways to do this might be [c]StringIO (in theory -- I don't know
if it's fast enough in practice, but this should be easy to test) or
the standard "".join(<list of strings>) approach (which underlies
StringIO's implementation as well -- though not cStringIO's IIRC).

> Now, from the discussion of "bytes" I get the impression that it, too, is
> a mutable type (someone said 'like list').

Correct. (You have no excuse to guess -- the implementation is in the
p3yk (sic) branch.)

> So given that 'characters' (i.e.
> unicode characters) are now distinct from 'bytes', it makes sense to me to
> declare a mutable character array. And to me, the most natural name for
> such a type is 'characters', although I suppose you could also call it
> "stringbuffer" or something.

I'm not sure that we need this. I'm not 100% sure we don't either, but
I believe that the existing approach works pretty well -- it's hard to
beat the "".join(<list>) approach.

> BTW, is the internal encoding of unicode strings UTF-8, UTF-16, UCS-2, or
> UTF-32? Just wondering...

Again, please look at the implementation. It's an array of shortish
ints; whether these are 2 or 3 or 4 bytes long depends on config
constants. (Just kidding about the 3 bytes version. :-) I think that
makes it UTF-16 or UTF-32. There is some support for surrogates when 2
bytes are used so I believe that excludes UCS-2.

Note that UTF-8 would make the implementation of Python's typical
string API painful; we currently assume (because it's true ;-) that
random access to elements and slices (__getitem__ and __getslice__) is
O(1). With UTF-8 these operations would be slow -- the simplest
implementation would require counting characters from the start; one
can speed this up with some kind of cache or tree but IMO the
array-of-fixed-width-characters approach is much simpler. (I had a bad
experience in my youth with strings implemented as trees, so I'm
biased against complicated string implementations. This also explains
why I'm no fan of the oft-proposed idea that slices should avoid
making physical copies even if they make logical copies -- the
complexity of that approach horrifies me.)

--
--Guido van Rossum (home page: http://www.python.org/~guido/)