Python speed and `pcre'

Tim Peters tim_one at email.msn.com
Mon Sep 6 03:44:14 EDT 1999


[François Pinard]
> ...
> If strings are immutable, there is really no need to copy a slice
> contents.  The only reason I really see to copy would be to
> simplify the garbage collector, which is more related to a particular
> implementation than the language design itself.

The design issues are more complicated than that, alas.  Here's one you
weren't thinking of <wink>:  for easy chatting with C, under the covers a
Python string is always terminated with a null byte.  You can't see that
from Python code (it's not counted in the string length, and you can't index
to it), but your C code can rely on it.

> Somewhere, it is said that some extra-space is allocated at end
> of sequences, to allow for some peaceful growth.

True of the mutable list sequence type, not true of the immutable string or
tuple sequence types.  Objects of immutable sequence types look like this
internally:

    pointer to type object
    refcount
    number of elements
    ... possibly other type-dependent fields ...
    vector with that many elements

That's one contiguous block of memory.  Objects of mutable sequence types
look the same, except the last field is a pointer *to* the vector of
elements, which latter is allocated separately.

So the immutable types are more storage-efficient, and faster to allocate
and deallocate, but cannot grow (Python never moves an object once
allocated, again making life a relative pleasure for non-Python code).

> Does `str=str+str1' at least attempt to use of this extra-space?

There's no extra space in a string.  A mutable sequence of characters can,
however, be gotten by using the std "array" module; like

>>> import array
>>> str1 = array.array("c", "This is") # array of "c"haracters
>>> str1.fromstring(" an example.")
>>> print str1.tostring()
This is an example.
>>> str1[8:8] = array.array("c", "not ")
>>> print str1
array('c', 'This is not an example.')
>>>

> It should, in my opinion.  Of course, you might tell me that the
> implementation does not notice the form `str=str+X' as really meaning
> `str+=X', but then, again, the inefficiency might be related to the
> implementation.

Both apply here, but the latter dominates.

> The trend, for a lot of years now, is to improve the implementation
> for speed, rather than adapt the programming style of all users to
> a given implementation.  If Python is meant to be a clean, clear
> language, one should not have to resort to convolutions to give it
> reasonable speed.  Or else, things look a bit artificial, and the
> gain in clarity is a bit lost.

You lose one way or another.  Use some descriptor-based mechanism instead
and half the world is upset at the extra storage that consumes when you've
got oodles of small strings (my friends at Sun tell me that's a common
bitter complaint from database slingers about Java's "shared" strings).

Python's internals tend to be straightforward and easy to explain, although
that's less true today than 5 years ago.  No implementation of anything is
maximally efficient for all purposes, and Python picked implementations that
are maximally efficient for the things Guido likes to do most <wink>.

> Oh, I'm sure we get used to idioms after a while, then simplicity
> is lost.

One of Python's predecessors was ABC, which represented most everything
internally as AVL trees.  Inserting, deleting, appending, indexing, sorting,
... none were ever expensive.  On the other hand, none were ever cheap
either, and overall ABC was plain sluggish despite that it saved you from
bad cases.

Is that simpler, or more complex?

> I'm not saying that `str+=X' should make it into the language, I
> quite guess this has been debated to death already, and I agree that
> the writing `str=str+X' has the virtue of clarity.

Augmented assignments (+= and friends) will probably show up some day.
That's a different issue, though; string immutability isn't a consequence of
the syntax.

Interesting:  Unlke Python, the string-processing Icon language supports
augmented assignment, and assignment to string slices, as in (pretending,
for clarity, that it uses "+" for string catenation and "=" for assignment):

    somestring += someotherstring
    somestring[i:j] = replacethemiddle

But strings in Icon are immutable too, and those are really syntactic sugar
for:

    somestring = somestring + someotherstring
    somestring = somestring[:i] + replacethemiddle + somestring[j:]

There is one special case in Icon.  If s2 is the string most recently
allocated, and s1 is the string second-most recently allocated, then

    s1 +:= s2

appends in place, and s1 retains its status as 2nd-most recently allocated
(so this works as hoped for in a loop without other cats).  Entire programs
are designed around that trick <0.5 wink>.

a-language-free-of-tradeoffs-doesn't-exist-ly y'rs  - tim






More information about the Python-list mailing list